PagerDuty 告警¶

一句话概述：PagerDuty 是企业级事件管理和告警平台，把监控系统的告警（Prometheus、Datadog、New Relic 等）按预设规则自动分发给对应的值班人员，支持电话、短信、邮件、App 推送，确保关键问题不被遗漏、不漏报、按时升级。

核心知识点¶

概念	白话解释
Incident	事件 = 一个需要人工处理的问题（由一个或多个告警触发）
Service	服务 = 告警的接收对象（如"数据库服务"、"API 服务"）
Escalation Policy	升级策略 = 第一个人没响应就自动通知第二个人的规则
On-call Schedule	值班表 = 定义谁在什么时间段负责响应告警
Integration	集成 = 连接第三方监控工具（Prometheus/Datadog/Grafana 等）
Alert	告警 = 从监控系统收到的原始通知
Dedup Key	去重键 = 相同 key 的告警会合并为一个 Incident，避免告警风暴
Routing Key	路由键 = Events API v2 的认证和路由标识符
Severity	严重程度 = critical > error > warning > info

白话解释¶

想象你是一家医院的院长，配了24小时值班医生。有时候监控设备会同时报警（血压、心率、血氧都超标），但都是同一个病人的问题，你希望： 1. 这些报警自动合并成"1号病床病人危急" 2. 先通知值班医生，如果2分钟没人接，再打给科室主任 3. 科室主任还没接，就打给院长

PagerDuty 就是这套"智能告警转接系统"： - 集成各种监控工具（相当于各种监控设备） - 去重合并告警（同一问题不重复打扰） - 按升级策略逐级通知（值班→主管→负责人） - 多种通知方式（App、电话、短信）

安装配置¶

第一步：注册 PagerDuty¶

1. 访问 https://www.pagerduty.com/sign-up/
2. 免费试用 14 天（后续需要付费，免费版功能有限）
3. 创建账号后进入 Operations Console

第二步：创建 Service（告警接收服务）¶

UI 操作路径：
Services → Service Directory → + New Service

配置项：
- Service Name: "Bioinfo API"（服务名称）
- Description: "生信分析 API 监控"（描述）
- Escalation Policy: 选择已有的或新建
- Alert Grouping: 开启（合并相关告警）
- Integrations: 选择集成来源（如 Custom Event Transformer）

第三步：创建值班计划（On-call Schedule）¶

UI 操作路径：
People → On-Call Schedules → + New On-Call Schedule

配置示例：
- 周一到周五 9:00-18:00：工程师A
- 周一到周五 18:00-次日9:00：工程师B
- 周末：工程师C

第四步：获取 Routing Key（集成密钥）¶

路径：Services → 你的服务 → Integrations → Add integration
选择：Events API v2
保存后获得：Routing Key（一串32位字符的密钥）

安装 Python SDK¶

# 安装官方 Python SDK（推荐）
pip install pagerduty               # 最新官方 SDK（推荐用于新项目）

# 注意：旧版 pdpyras 已废弃，请使用 pagerduty
pip install pdpyras                 # 旧版（已废弃，不要用）

# 验证安装
python -c "import pagerduty; print('PagerDuty SDK 安装成功')"

基本使用¶

方式一：使用 Events API v2 发送告警（直接 HTTP，无需 SDK）¶

import requests

def trigger_alert(
    routing_key: str,      # 从 PagerDuty Service Integration 获取
    summary: str,          # 告警摘要
    severity: str,         # 严重程度：critical/error/warning/info
    source: str,           # 来源（服务器名/服务名）
    dedup_key: str = None, # 去重键（相同 key 的告警合并为一个 Incident）
    details: dict = None,  # 额外详细信息
):
    """发送触发类型告警到 PagerDuty"""

    payload = {
        "routing_key": routing_key,      # 路由键（认证+路由）
        "event_action": "trigger",        # 动作类型：trigger（触发告警）
        "dedup_key": dedup_key,          # 去重键，相同 key 只创建一个 Incident
        "payload": {
            "summary": summary,           # 告警摘要（必填，显示在通知中）
            "severity": severity,         # 严重程度（必填）
            "source": source,             # 告警来源（必填）
            "custom_details": details or {},  # 自定义详细信息
        },
        # 可选：附加链接
        "links": [
            {
                "href": "http://grafana.example.com/dashboard",
                "text": "Grafana 监控面板",
            }
        ],
    }

    response = requests.post(
        "https://events.pagerduty.com/v2/enqueue",  # Events API v2 端点
        json=payload,
        timeout=10,
    )
    response.raise_for_status()              # 失败时抛出异常
    return response.json()                   # 返回包含 dedup_key 的响应


# 使用示例：数据库 CPU 告警
result = trigger_alert(
    routing_key="your_32_char_routing_key",
    summary="数据库 CPU 使用率超过 90%",
    severity="critical",
    source="db-server-01",
    dedup_key="db-01-cpu-high",              # 固定 dedup_key，同类问题不重复创建 Incident
    details={
        "cpu_percent": 93.5,
        "process_count": 248,
        "top_process": "mysql",
        "timestamp": "2026-05-13T10:30:00Z",
    }
)
print(f"告警已发送，状态: {result['status']}")  # 应返回 success
print(f"Dedup Key: {result['dedup_key']}")      # 自动生成或使用你的 dedup_key

方式二：确认和解决告警¶

import requests

def resolve_alert(routing_key: str, dedup_key: str):
    """解决告警（对应的 Incident 会自动关闭）"""

    response = requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json={
            "routing_key": routing_key,
            "event_action": "resolve",   # 解决告警
            "dedup_key": dedup_key,      # 必须与触发时的 dedup_key 一致
        },
        timeout=10,
    )
    response.raise_for_status()
    return response.json()

def acknowledge_alert(routing_key: str, dedup_key: str):
    """确认告警（表示已知晓问题，停止升级）"""

    response = requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json={
            "routing_key": routing_key,
            "event_action": "acknowledge",  # 确认告警
            "dedup_key": dedup_key,
        },
        timeout=10,
    )
    response.raise_for_status()
    return response.json()


# 使用示例：问题已修复，解除告警
resolve_alert("your_routing_key", "db-01-cpu-high")
print("告警已解除，对应 Incident 自动关闭")

方式三：使用官方 Python SDK¶

import pagerduty

# 初始化 Events API 客户端（发送告警用）
events_client = pagerduty.EventsApiV2Client(
    routing_key="your_32_char_routing_key"  # Service Integration Key
)

# 触发告警，获取 dedup_key（去重键）
dedup_key = events_client.trigger(
    summary="API 响应时间超过 5 秒",    # 告警摘要
    source="api-server-01",              # 来源
    severity="critical",                 # 严重程度
    # dedup_key="api-latency-high",      # 可选，不填则自动生成
)
print(f"告警已发送，Dedup Key: {dedup_key}")

# 确认告警（已知晓，正在处理）
events_client.acknowledge(dedup_key)
print("告警已确认")

# 解决告警（问题已修复）
events_client.resolve(dedup_key)
print("告警已解决")

# -----

# 初始化 REST API 客户端（管理账号用）
rest_client = pagerduty.RestApiV2Client(
    "your_user_api_key"  # 注意：这是 REST API Key，不是 Routing Key
)

# 查询当前所有触发状态的 Incident
incidents = rest_client.list_all(
    'incidents',
    params={
        'statuses[]': ['triggered'],   # 仅查询触发状态
        'urgencies[]': ['high'],        # 仅查询高优先级
    }
)
for incident in incidents:
    print(f"Incident ID: {incident['id']}, 标题: {incident['title']}")

# 批量确认 Incident
for i in incidents:
    i['status'] = 'acknowledged'       # 修改状态
updated = rest_client.rput('incidents', json=incidents)  # 批量更新
print(f"已确认 {len(updated)} 个 Incident")

高级用法¶

集成 Prometheus AlertManager¶

# prometheus/alertmanager.yml
# 将 Prometheus 告警路由到 PagerDuty

global:
  resolve_timeout: 5m            # 问题解决后等待 5 分钟再发 resolve

route:
  receiver: 'pagerduty-critical' # 默认接收方
  routes:
    - match:
        severity: critical        # critical 级别用高优先级
      receiver: 'pagerduty-critical'
    - match:
        severity: warning         # warning 级别用低优先级
      receiver: 'pagerduty-warning'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: 'your_routing_key_for_critical'  # Critical 服务的路由键
        send_resolved: true        # 告警解决时自动发送 resolve 事件
        description: '{{ .CommonAnnotations.summary }}'  # 告警描述
        details:
          summary: '{{ .CommonAnnotations.summary }}'
          description: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-warning'
    pagerduty_configs:
      - routing_key: 'your_routing_key_for_warning'
        send_resolved: true

集成 Grafana¶

在 Grafana 中配置 PagerDuty 通知：
1. 菜单 → Alerting → Contact points → New contact point
2. Type: PagerDuty
3. Integration Key: 填写 Routing Key
4. Severity: 动态（根据告警级别）
5. 在 Alert Rule 中绑定这个 Contact point

自动化运维脚本¶

import pagerduty
import time

def monitor_and_alert(check_func, routing_key: str, alert_key: str):
    """
    监控函数：持续检查，问题发生时告警，恢复时自动解除

    参数：
    - check_func: 返回 (is_ok, message) 元组的检查函数
    - routing_key: PagerDuty Service Routing Key
    - alert_key: 告警的唯一标识（dedup_key）
    """
    events_client = pagerduty.EventsApiV2Client(routing_key)

    dedup_key = alert_key    # 使用固定 key 实现去重
    is_alerting = False       # 当前是否处于告警状态

    while True:
        is_ok, message = check_func()   # 执行检查

        if not is_ok and not is_alerting:
            # 问题出现，发送告警
            events_client.trigger(
                summary=message,
                source="auto-monitor",
                dedup_key=dedup_key,
            )
            is_alerting = True
            print(f"[ALERT] {message}")

        elif is_ok and is_alerting:
            # 问题解决，发送 resolve
            events_client.resolve(dedup_key)
            is_alerting = False
            print(f"[RESOLVED] 问题已解决")

        time.sleep(60)    # 每 60 秒检查一次


# 使用示例：监控 API 可用性
def check_api():
    import requests
    try:
        resp = requests.get("http://localhost:8000/health", timeout=5)
        if resp.status_code == 200:
            return True, "API 正常"
        else:
            return False, f"API 返回异常状态码：{resp.status_code}"
    except Exception as e:
        return False, f"API 无法访问：{str(e)}"

# 运行监控（会一直循环）
monitor_and_alert(check_api, "your_routing_key", "api-health-check")

常见报错¶

报错信息	原因	解决方案
`HTTP 400 Bad Request`	请求格式错误	检查 JSON 格式，确保 payload 中有 summary、severity、source
`HTTP 401 Unauthorized`	Routing Key 错误	重新获取 Routing Key，注意区分 User API Key 和 Routing Key
`HTTP 429 Too Many Requests`	请求频率超限	添加重试逻辑和指数退避，避免告警风暴
`Invalid routing_key`	Routing Key 格式错误	Routing Key 应为 32 位字母数字字符串
`Incident not created`	告警被合并或去重	检查 dedup_key 是否与已有 Incident 相同
`Service disabled`	Service 已被禁用	在 UI 中重新启用 Service
`Connection timeout`	网络连接问题	检查防火墙，PagerDuty 需访问 events.pagerduty.com:443

速查表¶

# ===== Events API v2 三种事件动作 =====
# trigger     触发新告警（创建或更新 Incident）
# acknowledge 确认告警（停止升级通知，表示有人在处理）
# resolve     解决告警（关闭 Incident）

# ===== 严重程度（severity）=====
# critical    危急，立即处理（通常触发电话通知）
# error       错误，需要处理
# warning     警告，注意观察
# info        信息，仅记录

# ===== API 端点 =====
# Events API v2: https://events.pagerduty.com/v2/enqueue
# REST API v2:   https://api.pagerduty.com/
# EU 数据中心:   https://events.eu.pagerduty.com/v2/enqueue

# ===== 常用 REST API =====
# GET  /incidents              列出所有 Incident
# GET  /incidents/{id}         获取特定 Incident
# PUT  /incidents/{id}         更新 Incident（确认/解决）
# GET  /services               列出所有 Service
# GET  /schedules              列出值班计划
# GET  /users                  列出用户

# ===== Python SDK 快速参考 =====
# pip install pagerduty
# events_client = pagerduty.EventsApiV2Client(routing_key)
# dedup_key = events_client.trigger(summary, source)  # 触发
# events_client.acknowledge(dedup_key)                # 确认
# events_client.resolve(dedup_key)                    # 解决
# rest_client = pagerduty.RestApiV2Client(api_key)
# rest_client.list_all('incidents')                   # 列出所有 Incident

# ===== dedup_key 命名规范 =====
# {service}-{check-type}：db-01-cpu-high
# {host}-{metric}：web-server-01-disk-full
# 相同 dedup_key 的告警会合并为一个 Incident（去重功能）

特性	PagerDuty	OpsGenie	VictorOps	Grafana OnCall
定价	付费（14天试用）	付费（有免费版）	付费	免费（Grafana Cloud）
值班管理	功能完整	功能完整	功能完整	基础功能
升级策略	灵活	灵活	灵活	有限
事件分析	AI 分析	基础	无	无
集成数量	700+	200+	100+	100+
移动 App	iOS/Android	iOS/Android	iOS/Android	iOS/Android
API 完整度	非常完整	完整	完整	有限
适合规模	中大型企业	中型企业	中型企业	小团队