SRE运维体系建设：从传统运维到站点可靠性工程

引言

在数字化转型的浪潮中，传统的"救火式"运维模式已经无法满足现代互联网企业对高可用、高性能、快速迭代的需求。Site Reliability Engineering（站点可靠性工程，简称SRE）应运而生，成为构建现代化运维体系的重要方法论。

SRE最初由Google提出并实践，其核心思想是将软件工程方法应用到运维工作中，通过自动化、监控、可观测性等手段，确保系统的高可用性和可靠性。本文将深入探讨SRE的理念、实施方法、工具栈和最佳实践，帮助企业构建自己的SRE体系。

第一章：SRE核心理念与价值

1.1 什么是SRE

SRE（Site Reliability Engineering）是一种将软件工程原则应用到运维工作的实践方法。SRE工程师使用编程、自动化和监控等技术手段，确保互联网服务具备：

高可用性：系统能够持续提供服务，故障率控制在可接受范围内
高性能：系统响应时间满足业务需求，能够处理预期的负载
可扩展性：系统能够根据业务增长进行弹性扩展
快速恢复：发生故障时能够快速定位和恢复

1.2 SRE与传统运维的区别

对比维度	传统运维	SRE
工作方式	手工操作，被动响应	自动化为主，主动预防
目标导向	系统稳定	业务可靠性
技术手段	工具使用	软件工程方法
人员技能	技术操作	编程+运维
思维模式	避免错误	拥抱失败，快速恢复

1.3 SRE的核心价值观

拥抱风险：不要追求100%的可用性，要平衡成本与风险
自动化优先：所有可重复的工作都应该自动化
监控是王道：没有监控就没有可观测性
持续改进：不断优化流程和工具，提高效率
业务导向：技术决策要考虑业务影响

第二章：SRE核心实践体系

2.1 Service Level Objectives (SLO)

SLO是SRE的基础，用于定义系统的可靠性目标。

2.1.1 关键概念

SLI（Service Level Indicator）：服务水平指标，如响应时间、错误率
SLO（Service Level Objective）：服务水平目标，定义系统可靠性的具体数值
SLA（Service Level Agreement）：服务水平协议，对外承诺的服务级别

2.1.2 SLO制定原则

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# 典型SLO配置示例
slo:
  api_service:
    latency:
      target: "99% requests < 200ms"
      time_window: "30d"
    availability:
      target: "99.9% uptime"
      time_window: "30d"
    error_rate:
      target: "< 0.1%"
      time_window: "30d"
  
  database:
    query_latency:
      target: "95% < 100ms"
      time_window: "30d"
    availability:
      target: "99.99% uptime"
      time_window: "30d"

2.1.3 错误预算（Error Budget）

错误预算是SLO反向计算得出，允许的服务中断时间：

1
2
3
4


错误预算 = (1 - SLO目标) × 时间窗口

例如：99.9%可用性，30天时间窗口
错误预算 = (1 - 0.999) × 30天 = 0.001 × 30天 = 43.2分钟

2.2 监控与可观测性

2.2.1 监控三大支柱

指标监控（Metrics）
- 系统指标：CPU、内存、磁盘、网络
- 业务指标：QPS、响应时间、错误率
- 自定义指标：业务相关指标
日志监控（Logs）
- 应用日志
- 系统日志
- 安全日志
- 审计日志
链路追踪（Traces）
- 请求链路
- 分布式调用
- 性能分析

2.2.2 可观测性平台架构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# 现代化可观测性平台组件
observability_stack:
  metrics:
    - Prometheus: 指标收集和存储
    - Grafana: 指标可视化
    - Alertmanager: 告警管理
  
  logs:
    - Fluentd: 日志采集
    - Elasticsearch: 日志存储和搜索
    - Kibana: 日志分析界面
  
  traces:
    - Jaeger: 分布式追踪
    - OpenTelemetry: 统一遥测数据收集
  
  alerting:
    - Alertmanager: 告警路由和处理
    - PagerDuty: 告警升级
    - Slack/Email: 告警通知

2.3 容量规划与自动伸缩

2.3.1 容量规划流程

基线容量评估
- 分析历史数据
- 识别峰值模式
- 建立容量模型
增长预测
- 业务增长预测
- 技术架构变化
- 市场环境因素
成本效益分析
- 扩容成本计算
- 服务质量提升
- 业务价值评估

2.3.2 自动伸缩策略

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


# Kubernetes HPA配置示例
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60

第三章：SRE工具栈与实践

3.1 基础设施即代码（IaC）

3.1.1 Terraform基础设施管理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


# AWS基础设施 Terraform配置
provider "aws" {
  region = "us-west-2"
}

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name        = "sre-vpc"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_security_group" "web" {
  name_prefix = "web-"
  vpc_id      = aws_vpc.main.id
  
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

3.1.2 Ansible配置管理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


# Web服务器配置 playbook
---
- name: Web服务器配置
  hosts: web_servers
  become: yes
  vars:
    nginx_version: "1.20.2"
    app_port: 8080
  
  tasks:
    - name: 安装Nginx
      package:
        name: nginx
        state: present
    
    - name: 配置Nginx
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: 重启Nginx
    
    - name: 启动并启用Nginx
      service:
        name: nginx
        state: started
        enabled: yes
    
    - name: 配置监控
      template:
        src: node_exporter.service.j2
        dest: /etc/systemd/system/node_exporter.service
      notify: 重新加载systemd

  handlers:
    - name: 重启Nginx
      service:
        name: nginx
        state: restarted
    
    - name: 重新加载systemd
      systemd:
        daemon_reload: yes

3.2 CI/CD管道建设

3.2.1 GitLab CI/CD配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


# .gitlab-ci.yml配置示例
stages:
  - test
  - build
  - security-scan
  - deploy
  - verify

variables:
  DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

# 代码测试阶段
unit-test:
  stage: test
  image: python:3.9
  script:
    - pip install -r requirements.txt
    - python -m pytest tests/ --junitxml=report.xml
  coverage: '/TOTAL.*\s+(\d+%)$/'
  artifacts:
    reports:
      junit: report.xml

# 安全扫描
security-scan:
  stage: security-scan
  image: aquasec/trivy
  script:
    - trivy image --severity HIGH,CRITICAL $DOCKER_IMAGE
  allow_failure: true

# 容器镜像构建
build-image:
  stage: build
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build -t $DOCKER_IMAGE .
    - docker push $DOCKER_IMAGE
  only:
    - main
    - develop

# 部署到生产环境
deploy-production:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl set image deployment/web-app web-app=$DOCKER_IMAGE -n production
    - kubectl rollout status deployment/web-app -n production
  environment:
    name: production
    url: https://app.example.com
  only:
    - main
  when: manual

# 部署后验证
verify-deployment:
  stage: verify
  image: curlimages/curl:latest
  script:
    - curl -f https://app.example.com/health || exit 1
    - kubectl get pods -n production
  dependencies:
    - deploy-production

3.3 灾难恢复与备份策略

3.3.1 备份策略设计

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


# 备份配置示例
backup_strategy:
  databases:
    mysql:
      type: "logical"
      frequency: "daily"
      retention: "30d"
      encryption: true
      destination: "s3://backups/mysql/"
    redis:
      type: "rdb"
      frequency: "hourly"
      retention: "7d"
      destination: "s3://backups/redis/"
  
  filesystems:
    application_data:
      type: "incremental"
      frequency: "daily"
      retention: "90d"
      encryption: true
      destination: "s3://backups/app-data/"
  
  configurations:
    infrastructure:
      type: "git"
      frequency: "on-change"
      retention: "forever"
      destination: "git@github.com:company/infrastructure-configs.git"

3.3.2 灾难恢复计划

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


#!/bin/bash
# 灾难恢复脚本示例

# 数据库恢复
restore_database() {
    local backup_file=$1
    local target_db=$2
    
    echo "开始恢复数据库: $target_db"
    
    # 验证备份文件
    if [ ! -f "$backup_file" ]; then
        echo "错误: 备份文件不存在: $backup_file"
        exit 1
    fi
    
    # 停止应用服务
    kubectl scale deployment web-app --replicas=0 -n production
    
    # 恢复数据库
    mysql -u root -p"$DB_PASSWORD" "$target_db" < "$backup_file"
    
    # 验证数据完整性
    mysql -u root -p"$DB_PASSWORD" -e "SELECT COUNT(*) FROM users;" "$target_db"
    
    # 重新启动应用服务
    kubectl scale deployment web-app --replicas=5 -n production
    
    echo "数据库恢复完成"
}

# 主函数
main() {
    case "$1" in
        "database")
            restore_database "$2" "$3"
            ;;
        "application")
            echo "恢复应用程序配置..."
            kubectl apply -f /backups/app-config/
            ;;
        "infrastructure")
            echo "恢复基础设施配置..."
            terraform apply -auto-approve
            ;;
        *)
            echo "使用方法: $0 {database|application|infrastructure}"
            exit 1
            ;;
    esac
}

main "$@"

第四章：SRE实施路径与最佳实践

4.1 SRE实施路线图

4.1.1 第一阶段：基础设施监控（1-3个月）

目标：建立基础监控能力

主要工作：

部署监控工具（Prometheus + Grafana）
配置基础指标监控
建立告警体系
实施基础的日志收集

成功指标：

所有服务器都有监控覆盖
告警准确率达到90%以上
监控数据存储不少于30天

4.1.2 第二阶段：自动化运维（3-6个月）

目标：减少手工操作，提高效率

主要工作：

实施基础设施即代码
建设CI/CD流水线
自动化部署流程
配置管理自动化

成功指标：

90%以上部署实现自动化
基础设施变更100%代码化
部署失败率降低到5%以下

4.1.3 第三阶段：可靠性工程（6-12个月）

目标：提升系统可靠性

主要工作：

建立SLO体系
实施混沌工程
建设可观测性平台
优化容量规划

成功指标：

系统可用性达到99.9%
平均故障恢复时间(MTTR) < 30分钟
误报率降低到10%以下

4.1.4 第四阶段：智能化运维（12个月以上）

目标：实现智能化运维

主要工作：

AI驱动的异常检测
自动化故障诊断
预测性容量规划
自愈系统建设

4.2 SRE团队组织架构

4.2.1 团队结构设计

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


sre_team_structure:
  team_lead:
    title: "SRE团队负责人"
    responsibilities:
      - SRE战略规划
      - 团队管理
      - 跨部门协调
  
  senior_sre:
    title: "高级SRE工程师"
    count: 2
    responsibilities:
      - 架构设计
      - 技术决策
      - 指导初级工程师
  
  sre_engineers:
    title: "SRE工程师"
    count: 4
    responsibilities:
      - 监控系统建设
      - 自动化开发
      - 故障处理
  
  junior_sre:
    title: "初级SRE工程师"
    count: 2
    responsibilities:
      - 日常运维
      - 监控维护
      - 学习成长

4.2.2 岗位职责与技能要求

SRE工程师技能矩阵：

技能领域	初级	中级	高级
操作系统	Linux基础	系统调优	内核优化
编程能力	Shell脚本	Python/Go	架构设计
云计算	基础使用	云原生技术	多云管理
容器技术	Docker基础	Kubernetes	Service Mesh
监控工具	Grafana使用	Prometheus开发	可观测性架构
网络安全	基础配置	安全加固	安全架构
业务理解	了解业务	深入理解	业务驱动

4.3 关键绩效指标（KPI）

4.3.1 技术指标

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


technical_kpis:
  availability:
    target: "99.9%"
    measurement: "uptime_percentage"
    period: "monthly"
  
  latency:
    target: "95% requests < 200ms"
    measurement: "p95_response_time"
    period: "daily"
  
  error_rate:
    target: "< 0.1%"
    measurement: "error_percentage"
    period: "daily"
  
  mttr:
    target: "< 30 minutes"
    measurement: "mean_time_to_recovery"
    period: "incident"
  
  mtbf:
    target: "> 30 days"
    measurement: "mean_time_between_failures"
    period: "monthly"

4.3.2 运营指标

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


operational_kpis:
  deployment_frequency:
    target: "> 10 times per day"
    measurement: "deployments_per_day"
    period: "weekly"
  
  lead_time:
    target: "< 1 day"
    measurement: "time_from_commit_to_production"
    period: "weekly"
  
  change_failure_rate:
    target: "< 5%"
    measurement: "failed_deployments / total_deployments"
    period: "monthly"
  
  automation_coverage:
    target: "> 80%"
    measurement: "automated_tasks / total_tasks"
    period: "monthly"

第五章：SRE工具栈实践案例

5.1 Kubernetes环境SRE实践

5.1.1 完整的监控栈部署

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


#!/bin/bash
# Kubernetes监控栈部署脚本

set -e

NAMESPACE="monitoring"
RELEASE_NAME="monitoring"

echo "开始部署SRE监控栈..."

# 添加Prometheus社区仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# 创建命名空间
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -

# 部署完整监控栈
helm upgrade --install $RELEASE_NAME prometheus-community/kube-prometheus-stack \
  --namespace $NAMESPACE \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.probeSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
  --set grafana.adminPassword="StrongAdminPassword123!" \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi

# 部署Loki日志系统
helm upgrade --install loki grafana/loki-stack \
  --namespace $NAMESPACE \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi \
  --set promtail.enabled=true

# 部署Jaeger分布式追踪
helm upgrade --install jaeger jaegertracing/jaeger \
  --namespace $NAMESPACE \
  --set collector.replicaCount=2 \
  --set query.replicaCount=1

echo "SRE监控栈部署完成！"
echo "访问地址："
echo "- Grafana: http://localhost:3000"
echo "- Prometheus: http://localhost:9090"
echo "- Jaeger: http://localhost:16686"

5.1.2 业务应用监控配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


# 应用监控配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-monitoring-config
  namespace: default
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "/etc/prometheus/rules/*.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
      
      - job_name: 'my-app'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app]
          action: keep
          regex: my-app
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2

---
# ServiceMonitor配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: default
  labels:
    app: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s
    honorLabels: true

5.2 自动化运维实践

5.2.1 智能告警系统

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228


#!/usr/bin/env python3
# 智能告警系统

import asyncio
import aioredis
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from dataclasses import dataclass
import smtplib
from email.mime.text import MimeText
from email.mime.multipart import MimeMultipart

@dataclass
class AlertRule:
    name: str
    metric: str
    condition: str
    threshold: float
    duration: int  # seconds
    severity: str  # critical, warning, info
    description: str

class IntelligentAlerting:
    def __init__(self, redis_url: str):
        self.redis = aioredis.from_url(redis_url)
        self.alert_rules = self._load_alert_rules()
        self.notification_channels = self._setup_notification_channels()
        
    def _load_alert_rules(self) -> List[AlertRule]:
        """加载告警规则"""
        return [
            AlertRule(
                name="high_cpu_usage",
                metric="cpu_usage_percent",
                condition=">",
                threshold=80.0,
                duration=300,  # 5分钟
                severity="warning",
                description="CPU使用率持续超过80%"
            ),
            AlertRule(
                name="critical_cpu_usage",
                metric="cpu_usage_percent",
                condition=">",
                threshold=95.0,
                duration=120,  # 2分钟
                severity="critical",
                description="CPU使用率持续超过95%"
            ),
            AlertRule(
                name="high_memory_usage",
                metric="memory_usage_percent",
                condition=">",
                threshold=85.0,
                duration=300,
                severity="warning",
                description="内存使用率持续超过85%"
            ),
            AlertRule(
                name="service_down",
                metric="service_health",
                condition="==",
                threshold=0.0,
                duration=60,
                severity="critical",
                description="服务不可用"
            )
        ]
    
    async def evaluate_alerts(self):
        """评估告警规则"""
        for rule in self.alert_rules:
            try:
                # 获取指标数据
                metric_value = await self._get_metric_value(rule.metric)
                
                # 检查是否触发告警
                if self._evaluate_condition(metric_value, rule.condition, rule.threshold):
                    await self._check_alert_duration(rule, metric_value)
                else:
                    # 清除告警状态
                    await self._clear_alert(rule)
                    
            except Exception as e:
                logging.error(f"评估告警规则失败 {rule.name}: {e}")
    
    async def _get_metric_value(self, metric: str) -> float:
        """获取指标值"""
        # 这里应该连接Prometheus或其他监控系统
        # 简化示例，直接从Redis获取
        value = await self.redis.get(f"metric:{metric}")
        return float(value) if value else 0.0
    
    def _evaluate_condition(self, value: float, condition: str, threshold: float) -> bool:
        """评估条件"""
        if condition == ">":
            return value > threshold
        elif condition == ">=":
            return value >= threshold
        elif condition == "<":
            return value < threshold
        elif condition == "<=":
            return value <= threshold
        elif condition == "==":
            return value == threshold
        elif condition == "!=":
            return value != threshold
        return False
    
    async def _check_alert_duration(self, rule: AlertRule, metric_value: float):
        """检查告警持续时间"""
        alert_key = f"alert:{rule.name}"
        
        # 检查是否已经有告警
        existing_alert = await self.redis.get(alert_key)
        
        if existing_alert:
            # 告警已存在，检查是否需要升级
            alert_data = eval(existing_alert.decode())
            if alert_data['severity'] != rule.severity:
                await self._update_alert_severity(rule, alert_data)
        else:
            # 新告警，记录开始时间
            alert_data = {
                'start_time': datetime.now().isoformat(),
                'metric_value': metric_value,
                'severity': rule.severity,
                'triggered_at': datetime.now().isoformat()
            }
            await self.redis.setex(
                alert_key, 
                rule.duration + 300,  # 额外5分钟缓冲
                str(alert_data)
            )
            
            # 检查是否达到持续时间阈值
            if rule.duration == 0:  # 立即告警
                await self._trigger_alert(rule, metric_value)
            else:
                # 启动定时检查
                asyncio.create_task(self._check_alert_duration_async(rule, metric_value, rule.duration))
    
    async def _check_alert_duration_async(self, rule: AlertRule, metric_value: float, duration: int):
        """异步检查告警持续时间"""
        await asyncio.sleep(duration)
        
        # 再次检查指标值
        current_value = await self._get_metric_value(rule.metric)
        if self._evaluate_condition(current_value, rule.condition, rule.threshold):
            await self._trigger_alert(rule, current_value)
    
    async def _trigger_alert(self, rule: AlertRule, metric_value: float):
        """触发告警"""
        alert_data = {
            'name': rule.name,
            'metric': rule.metric,
            'value': metric_value,
            'threshold': rule.threshold,
            'condition': rule.condition,
            'severity': rule.severity,
            'description': rule.description,
            'triggered_at': datetime.now().isoformat()
        }
        
        # 记录告警
        await self.redis.lpush("active_alerts", str(alert_data))
        
        # 发送通知
        await self._send_notifications(rule, alert_data)
        
        logging.warning(f"告警触发: {rule.name} - {rule.description}")
    
    async def _send_notifications(self, rule: AlertRule, alert_data: Dict):
        """发送告警通知"""
        for channel in self.notification_channels:
            try:
                if channel['type'] == 'email':
                    await self._send_email_alert(rule, alert_data, channel)
                elif channel['type'] == 'slack':
                    await self._send_slack_alert(rule, alert_data, channel)
                elif channel['type'] == 'webhook':
                    await self._send_webhook_alert(rule, alert_data, channel)
            except Exception as e:
                logging.error(f"发送告警通知失败: {e}")
    
    async def _send_email_alert(self, rule: AlertRule, alert_data: Dict, channel: Dict):
        """发送邮件告警"""
        msg = MimeMultipart()
        msg['From'] = channel['from']
        msg['To'] = ', '.join(channel['to'])
        msg['Subject'] = f"[{rule.severity.upper()}] {rule.name}"
        
        body = f"""
        告警详情：
        
        告警名称：{rule.name}
        告警级别：{rule.severity}
        告警描述：{rule.description}
        
        指标名称：{rule.metric}
        当前值：{alert_data['value']}
        阈值：{rule.threshold} {rule.condition}
        
        触发时间：{alert_data['triggered_at']}
        
        请及时处理！
        """
        
        msg.attach(MimeText(body, 'plain', 'utf-8'))
        
        server = smtplib.SMTP(channel['smtp_server'], channel['smtp_port'])
        server.starttls()
        server.login(channel['username'], channel['password'])
        server.send_message(msg)
        server.quit()

# 主程序
async def main():
    alerting = IntelligentAlerting("redis://localhost:6379")
    
    # 启动告警评估循环
    while True:
        await alerting.evaluate_alerts()
        await asyncio.sleep(30)  # 每30秒评估一次

if __name__ == "__main__":
    asyncio.run(main())

5.2.2 自动故障恢复系统

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76


# 自动故障恢复配置
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: auto-healing-agent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: auto-healing-agent
  template:
    metadata:
      labels:
        app: auto-healing-agent
    spec:
      containers:
      - name: auto-healing-agent
        image: busybox
        command:
        - /bin/sh
        - -c
        - |
          #!/bin/sh
          while true; do
            # 检查Pod状态
            kubectl get pods --all-namespaces -o json | \
            jq -r '.items[] | select(.status.phase != "Running") | .metadata.name' | \
            while read pod; do
              namespace=$(kubectl get pod $pod -o jsonpath='{.metadata.namespace}')
              echo "检测到异常Pod: $pod in $namespace"
              
              # 获取Pod详细信息
              pod_info=$(kubectl get pod $pod -n $namespace -o json)
              restart_count=$(echo $pod_info | jq -r '.status.containerStatuses[0].restartCount')
              reason=$(echo $pod_info | jq -r '.status.containerStatuses[0].state.waiting.reason // "Unknown"')
              
              # 如果重启次数过多，进行故障恢复
              if [ "$restart_count" -gt 3 ]; then
                echo "Pod $pod 重启次数过多，执行自动恢复..."
                
                # 删除Pod让Kubernetes重新创建
                kubectl delete pod $pod -n $namespace --grace-period=30
                
                # 记录恢复操作
                echo "$(date): Auto-recovered pod $pod in $namespace" >> /var/log/auto-healing.log
              fi
            done
            
            # 检查节点状态
            kubectl get nodes -o json | \
            jq -r '.items[] | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | .metadata.name' | \
            while read node; do
              echo "检测到异常节点: $node"
              
              # 尝试重新调度Pod
              kubectl drain $node --ignore-daemonsets --delete-uncordon
              
              # 等待节点恢复
              sleep 300  # 等待5分钟
              
              # 重新启用节点
              kubectl uncordon $node
              
              echo "$(date): Auto-uncordon node $node" >> /var/log/auto-healing.log
            done
            
            sleep 60  # 每分钟检查一次
          done
        volumeMounts:
        - name: varlog
          mountPath: /var/log
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
          type: DirectoryOrCreate

5.3 混沌工程实践

5.3.1 混沌实验框架

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


# 混沌工程实验配置
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-chaos
  namespace: production
spec:
  selector:
    namespaces:
      - production
    labelSelectors:
      app: "web-app"
  mode: one
  action: pod-failure
  duration: "30s"

---
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-chaos
  namespace: production
spec:
  selector:
    namespaces:
      - production
    labelSelectors:
      app: "database"
  mode: all
  action: delay
  delay:
    latency: "100ms"
    correlation: "100"
    jitter: "10ms"
  duration: "60s"

---
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-chaos
  namespace: production
spec:
  selector:
    namespaces:
      - production
    labelSelectors:
      app: "api-gateway"
  mode: fixed
  fixed: 2
  stressng_spec:
    cpu:
      workers: 4
      load: 80
  duration: "120s"

5.3.2 混沌实验管理脚本

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195


#!/usr/bin/env python3
# 混沌实验管理系统

import asyncio
import yaml
import logging
from datetime import datetime
from typing import Dict, List
import subprocess
import json

class ChaosExperimentManager:
    def __init__(self):
        self.logger = self._setup_logging()
        self.experiments = []
    
    def _setup_logging(self):
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        return logging.getLogger(__name__)
    
    def load_experiments(self, config_file: str):
        """加载混沌实验配置"""
        with open(config_file, 'r') as f:
            config = yaml.safe_load(f)
            self.experiments = config.get('experiments', [])
        self.logger.info(f"加载了 {len(self.experiments)} 个混沌实验")
    
    async def run_experiment(self, experiment_name: str):
        """运行指定实验"""
        experiment = next((exp for exp in self.experiments if exp['name'] == experiment_name), None)
        if not experiment:
            self.logger.error(f"实验 {experiment_name} 不存在")
            return
        
        self.logger.info(f"开始执行实验: {experiment_name}")
        
        try:
            # 执行前置检查
            await self._pre_experiment_check(experiment)
            
            # 执行混沌实验
            await self._execute_chaos(experiment)
            
            # 监控实验结果
            await self._monitor_experiment(experiment)
            
            # 清理实验
            await self._cleanup_experiment(experiment)
            
            self.logger.info(f"实验 {experiment_name} 执行完成")
            
        except Exception as e:
            self.logger.error(f"实验 {experiment_name} 执行失败: {e}")
            await self._emergency_cleanup(experiment)
    
    async def _pre_experiment_check(self, experiment: Dict):
        """执行前置检查"""
        self.logger.info("执行前置健康检查...")
        
        # 检查系统健康状态
        health_status = await self._check_system_health()
        if not health_status['healthy']:
            raise Exception(f"系统健康检查失败: {health_status}")
        
        # 检查备份状态
        backup_status = await self._check_backup_status()
        if not backup_status['recent_backup']:
            raise Exception("没有最近的备份，请先创建备份")
        
        # 检查告警配置
        await self._setup_experiment_alerts(experiment)
        
        self.logger.info("前置检查通过")
    
    async def _execute_chaos(self, experiment: Dict):
        """执行混沌实验"""
        chaos_type = experiment['type']
        
        if chaos_type == 'pod_failure':
            await self._execute_pod_failure(experiment)
        elif chaos_type == 'network_chaos':
            await self._execute_network_chaos(experiment)
        elif chaos_type == 'resource_stress':
            await self._execute_resource_stress(experiment)
        else:
            raise Exception(f"不支持的混沌实验类型: {chaos_type}")
    
    async def _execute_pod_failure(self, experiment: Dict):
        """执行Pod故障实验"""
        self.logger.info("执行Pod故障注入...")
        
        # 使用kubectl删除指定Pod
        target_pods = experiment['target_pods']
        for pod in target_pods:
            self.logger.info(f"删除Pod: {pod}")
            result = await asyncio.create_subprocess_exec(
                'kubectl', 'delete', 'pod', pod,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )
            await result.communicate()
    
    async def _monitor_experiment(self, experiment: Dict):
        """监控实验过程"""
        duration = experiment['duration']
        self.logger.info(f"开始监控实验，持续时间: {duration}秒")
        
        start_time = datetime.now()
        end_time = start_time + timedelta(seconds=duration)
        
        metrics = {
            'response_times': [],
            'error_rates': [],
            'availability': []
        }
        
        while datetime.now() < end_time:
            # 收集指标
            current_metrics = await self._collect_metrics()
            metrics['response_times'].append(current_metrics['response_time'])
            metrics['error_rates'].append(current_metrics['error_rate'])
            metrics['availability'].append(current_metrics['availability'])
            
            # 检查是否需要告警
            if current_metrics['error_rate'] > experiment.get('error_threshold', 0.1):
                self.logger.warning(f"错误率过高: {current_metrics['error_rate']}")
            
            await asyncio.sleep(10)  # 每10秒收集一次
        
        # 分析结果
        await self._analyze_experiment_results(metrics, experiment)
    
    async def _collect_metrics(self) -> Dict:
        """收集系统指标"""
        # 这里应该从Prometheus或其他监控系统获取实际指标
        # 简化示例
        import random
        
        return {
            'response_time': random.uniform(100, 500),  # ms
            'error_rate': random.uniform(0, 0.05),     # 0-5%
            'availability': random.uniform(0.95, 1.0)  # 95-100%
        }
    
    async def _analyze_experiment_results(self, metrics: Dict, experiment: Dict):
        """分析实验结果"""
        self.logger.info("分析实验结果...")
        
        # 计算统计指标
        avg_response_time = sum(metrics['response_times']) / len(metrics['response_times'])
        max_error_rate = max(metrics['error_rates'])
        min_availability = min(metrics['availability'])
        
        # 生成报告
        report = {
            'experiment_name': experiment['name'],
            'start_time': datetime.now().isoformat(),
            'duration': experiment['duration'],
            'results': {
                'avg_response_time': avg_response_time,
                'max_error_rate': max_error_rate,
                'min_availability': min_availability
            },
            'status': 'success' if max_error_rate < 0.1 else 'warning'
        }
        
        # 保存报告
        await self._save_experiment_report(report)
        
        self.logger.info(f"实验结果: {report}")
    
    async def _save_experiment_report(self, report: Dict):
        """保存实验报告"""
        filename = f"chaos_report_{report['experiment_name']}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        
        with open(filename, 'w') as f:
            json.dump(report, f, indent=2)
        
        self.logger.info(f"实验报告已保存: {filename}")

# 使用示例
async def main():
    manager = ChaosExperimentManager()
    manager.load_experiments('chaos_experiments.yaml')
    
    # 运行所有实验
    for experiment in manager.experiments:
        await manager.run_experiment(experiment['name'])
        await asyncio.sleep(60)  # 实验间隔60秒

if __name__ == "__main__":
    asyncio.run(main())

结语

SRE不仅仅是一套工具和技术，更是一种文化和理念。它要求我们用软件工程的方法来解决运维问题，通过自动化、监控、可观测性等手段，构建高可用、可扩展的现代化系统。

实施SRE是一个渐进的过程，需要组织文化、团队技能、工具平台的共同演进。关键是要从小处着手，逐步建立能力，持续改进优化。

成功的SRE实践需要关注以下几个要点：

明确的目标：建立清晰的SLO和错误预算
工具赋能：选择合适的技术栈和工具
文化建设：培养自动化的思维和习惯
持续改进：定期回顾和优化流程
人才培养：建设有技术能力的SRE团队

在数字化时代，SRE将成为企业IT运维的核心能力，帮助企业在保证系统稳定性的同时，实现快速迭代和业务增长。

记住：可靠性不是偶然的，而是设计出来的。

最后更新：2025年10月9日