549_监控告警系统设计¶
一句话说明¶
监控告警系统是生产环境的"神经系统",实时收集指标、发现异常并自动通知相关人员。
核心知识点¶
监控的四个黄金信号(SRE理念)¶
1. 延迟(Latency):请求完成需要多长时间
- 区分正常请求和错误请求的延迟!
2. 流量(Traffic):系统有多少负载
- QPS、同时运行的任务数
3. 错误率(Errors):失败请求的比例
- HTTP 5xx 比例、任务失败率
4. 饱和度(Saturation):系统有多满
- CPU/内存/磁盘使用率、队列深度
监控层次¶
基础设施监控
- 服务器:CPU、内存、磁盘、网络
- 容器:Pod状态、资源使用
应用监控
- API延迟、错误率、QPS
- 任务成功/失败率
业务监控
- 每日提交任务数
- 平均分析耗时趋势
- 用户活跃度
告警策略
- P1(紧急):服务宕机、数据丢失风险 → 立即电话
- P2(严重):错误率>5%、磁盘>90% → 10分钟内响应
- P3(警告):延迟偏高、CPU持续>80% → 1小时内响应
实战代码/设计图/模板¶
Prometheus + Grafana 监控架构¶
[被监控服务]
暴露 /metrics 端点(HTTP)
│
[Prometheus Server] ← 定期拉取指标(Pull模式)
- 存储时序数据(TSDB)
- 评估告警规则
│
┌────┴────┐
▼ ▼
[Grafana] [Alertmanager]
(可视化) │ 告警路由
├── 邮件
├── 钉钉/飞书
└── PagerDuty
Python 服务暴露监控指标¶
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# 定义指标
# Counter:只增不减(请求总数、错误总数)
job_total = Counter(
'bioinf_job_total',
'Total jobs submitted',
['pipeline_type', 'status'] # 标签:按pipeline和状态分类
)
# Histogram:分布统计(延迟)
job_duration = Histogram(
'bioinf_job_duration_seconds',
'Job execution time in seconds',
['pipeline_type'],
buckets=[300, 900, 1800, 3600, 7200, 14400] # 5min到4h
)
# Gauge:可增可减(当前运行任务数)
active_jobs = Gauge(
'bioinf_active_jobs',
'Currently running jobs',
['pipeline_type']
)
# 暴露指标端口
start_http_server(8001) # http://localhost:8001/metrics
# 在业务代码中使用
def run_analysis_job(job_id: str, pipeline: str):
active_jobs.labels(pipeline_type=pipeline).inc()
start_time = time.time()
try:
result = execute_pipeline(job_id, pipeline)
job_total.labels(pipeline_type=pipeline, status='success').inc()
return result
except Exception as e:
job_total.labels(pipeline_type=pipeline, status='failed').inc()
raise
finally:
duration = time.time() - start_time
job_duration.labels(pipeline_type=pipeline).observe(duration)
active_jobs.labels(pipeline_type=pipeline).dec()
Prometheus 告警规则¶
# alert_rules.yml
groups:
- name: bioinf_platform_alerts
rules:
# 任务失败率过高
- alert: HighJobFailureRate
expr: |
rate(bioinf_job_total{status="failed"}[5m]) /
rate(bioinf_job_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "任务失败率过高"
description: "过去5分钟任务失败率={{ $value | humanizePercentage }}"
# 磁盘使用超过85%
- alert: DiskSpaceWarning
expr: |
(node_filesystem_size_bytes{mountpoint="/data"} -
node_filesystem_free_bytes{mountpoint="/data"}) /
node_filesystem_size_bytes{mountpoint="/data"} > 0.85
for: 10m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "数据盘使用率={{ $value | humanizePercentage }}"
# 任务队列积压
- alert: JobQueueBacklog
expr: bioinf_queue_depth > 1000
for: 15m
labels:
severity: warning
annotations:
summary: "任务队列积压"
description: "队列深度={{ $value }}"
简单日志告警(Python)¶
import logging
import smtplib
from email.mime.text import MIMEText
class AlertHandler(logging.Handler):
"""日志告警处理器:ERROR级别日志自动发邮件"""
def emit(self, record):
if record.levelno >= logging.ERROR:
self.send_alert(
subject=f"[告警] {record.funcName}: {record.getMessage()}",
body=self.format(record)
)
def send_alert(self, subject: str, body: str):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = 'alert@example.com'
msg['To'] = 'oncall@example.com'
with smtplib.SMTP('smtp.example.com') as server:
server.send_message(msg)
# 配置
logger = logging.getLogger('bioinf')
logger.addHandler(AlertHandler())
面试常问点¶
| 问题 | 参考答案 |
|---|---|
| Push vs Pull 模式? | Prometheus用Pull,更灵活;Graphite用Push |
| 如何减少告警噪音? | 设置for(持续时间),分级告警 |
| 如何做容量规划? | 趋势预测,提前15天扩容 |
| 什么是SLO/SLA? | SLO=目标(内部),SLA=协议(对外承诺) |
| 如何监控批处理任务? | 监控任务完成率、平均耗时、队列深度 |
速查表¶
Prometheus 常用查询(PromQL):
# QPS(过去5分钟)
rate(http_requests_total[5m])
# 错误率
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
# P99延迟
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
# CPU使用率
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
常用告警阈值(参考):
CPU使用率 > 80% 持续5分钟
内存使用率 > 85%
磁盘使用率 > 85%
错误率 > 1% 持续5分钟
任务失败率 > 10%