549_监控告警系统设计¶

一句话说明¶

监控告警系统是生产环境的"神经系统"，实时收集指标、发现异常并自动通知相关人员。

核心知识点¶

监控的四个黄金信号（SRE理念）¶

1. 延迟（Latency）：请求完成需要多长时间
   - 区分正常请求和错误请求的延迟！

2. 流量（Traffic）：系统有多少负载
   - QPS、同时运行的任务数

3. 错误率（Errors）：失败请求的比例
   - HTTP 5xx 比例、任务失败率

4. 饱和度（Saturation）：系统有多满
   - CPU/内存/磁盘使用率、队列深度

监控层次¶

基础设施监控
  - 服务器：CPU、内存、磁盘、网络
  - 容器：Pod状态、资源使用

应用监控
  - API延迟、错误率、QPS
  - 任务成功/失败率

业务监控
  - 每日提交任务数
  - 平均分析耗时趋势
  - 用户活跃度

告警策略
  - P1（紧急）：服务宕机、数据丢失风险 → 立即电话
  - P2（严重）：错误率>5%、磁盘>90% → 10分钟内响应
  - P3（警告）：延迟偏高、CPU持续>80% → 1小时内响应

实战代码/设计图/模板¶

Prometheus + Grafana 监控架构¶

[被监控服务]
  暴露 /metrics 端点（HTTP）
       │
[Prometheus Server]  ← 定期拉取指标（Pull模式）
  - 存储时序数据（TSDB）
  - 评估告警规则
       │
  ┌────┴────┐
  ▼         ▼
[Grafana]  [Alertmanager]
（可视化）   │ 告警路由
            ├── 邮件
            ├── 钉钉/飞书
            └── PagerDuty

Python 服务暴露监控指标¶

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# 定义指标
# Counter：只增不减（请求总数、错误总数）
job_total = Counter(
    'bioinf_job_total',
    'Total jobs submitted',
    ['pipeline_type', 'status']  # 标签：按pipeline和状态分类
)

# Histogram：分布统计（延迟）
job_duration = Histogram(
    'bioinf_job_duration_seconds',
    'Job execution time in seconds',
    ['pipeline_type'],
    buckets=[300, 900, 1800, 3600, 7200, 14400]  # 5min到4h
)

# Gauge：可增可减（当前运行任务数）
active_jobs = Gauge(
    'bioinf_active_jobs',
    'Currently running jobs',
    ['pipeline_type']
)

# 暴露指标端口
start_http_server(8001)  # http://localhost:8001/metrics

# 在业务代码中使用
def run_analysis_job(job_id: str, pipeline: str):
    active_jobs.labels(pipeline_type=pipeline).inc()
    start_time = time.time()

    try:
        result = execute_pipeline(job_id, pipeline)
        job_total.labels(pipeline_type=pipeline, status='success').inc()
        return result
    except Exception as e:
        job_total.labels(pipeline_type=pipeline, status='failed').inc()
        raise
    finally:
        duration = time.time() - start_time
        job_duration.labels(pipeline_type=pipeline).observe(duration)
        active_jobs.labels(pipeline_type=pipeline).dec()

Prometheus 告警规则¶

# alert_rules.yml
groups:
  - name: bioinf_platform_alerts
    rules:
      # 任务失败率过高
      - alert: HighJobFailureRate
        expr: |
          rate(bioinf_job_total{status="failed"}[5m]) /
          rate(bioinf_job_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "任务失败率过高"
          description: "过去5分钟任务失败率={{ $value | humanizePercentage }}"

      # 磁盘使用超过85%
      - alert: DiskSpaceWarning
        expr: |
          (node_filesystem_size_bytes{mountpoint="/data"} -
           node_filesystem_free_bytes{mountpoint="/data"}) /
          node_filesystem_size_bytes{mountpoint="/data"} > 0.85
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足"
          description: "数据盘使用率={{ $value | humanizePercentage }}"

      # 任务队列积压
      - alert: JobQueueBacklog
        expr: bioinf_queue_depth > 1000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "任务队列积压"
          description: "队列深度={{ $value }}"

简单日志告警（Python）¶

import logging
import smtplib
from email.mime.text import MIMEText

class AlertHandler(logging.Handler):
    """日志告警处理器：ERROR级别日志自动发邮件"""

    def emit(self, record):
        if record.levelno >= logging.ERROR:
            self.send_alert(
                subject=f"[告警] {record.funcName}: {record.getMessage()}",
                body=self.format(record)
            )

    def send_alert(self, subject: str, body: str):
        msg = MIMEText(body)
        msg['Subject'] = subject
        msg['From'] = 'alert@example.com'
        msg['To'] = 'oncall@example.com'

        with smtplib.SMTP('smtp.example.com') as server:
            server.send_message(msg)

# 配置
logger = logging.getLogger('bioinf')
logger.addHandler(AlertHandler())

面试常问点¶

问题	参考答案
Push vs Pull 模式？	Prometheus用Pull，更灵活；Graphite用Push
如何减少告警噪音？	设置for（持续时间），分级告警
如何做容量规划？	趋势预测，提前15天扩容
什么是SLO/SLA？	SLO=目标（内部），SLA=协议（对外承诺）
如何监控批处理任务？	监控任务完成率、平均耗时、队列深度

速查表¶

Prometheus 常用查询（PromQL）：
  # QPS（过去5分钟）
  rate(http_requests_total[5m])

  # 错误率
  rate(http_requests_total{status=~"5.."}[5m]) /
  rate(http_requests_total[5m])

  # P99延迟
  histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))

  # CPU使用率
  100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

常用告警阈值（参考）：
  CPU使用率 > 80% 持续5分钟
  内存使用率 > 85%
  磁盘使用率 > 85%
  错误率 > 1% 持续5分钟
  任务失败率 > 10%