773. 生信流程基准测试方法¶

一句话概述：用已知"标准答案"来测试你的分析流程准不准、快不快、用多少资源——就像考试有标准答案一样，基准测试用"标准数据集"来给你的流程打分。

核心知识点速查表¶

概念	白话解释	关键工具
Benchmark	基准测试（用标准数据评估方法）	"考试"
Ground truth	已知的正确答案	"标准答案"
灵敏度/召回率	找到多少真阳性	TP/(TP+FN)
精确度	找到的里有多少是对的	TP/(TP+FP)
F1分数	灵敏度和精确度的调和平均	综合评分
GIAB	基因组参考样本	NA12878/HG001

一、原理（白话版）¶

1.1 为什么需要基准测试？¶

问题：你开发/选择了一个分析流程，怎么知道它好不好？

类比：
  你做了一道数学题，怎么知道做对了？
  → 对答案！
  基准测试就是"对答案"的过程

基准测试的三个维度：
  ① 准确性：结果有多准？（灵敏度、特异性、精确度）
  ② 速度：跑多快？（wall time、CPU time）
  ③ 资源：用多少资源？（内存、磁盘、CPU核数）

常见基准数据集：
  变异检测 → GIAB（Genome in a Bottle）NA12878
  RNA-seq → SEQC/MAQC参考样本
  宏基因组 → CAMI挑战赛模拟数据
  单细胞 → Splatter模拟数据
  比对 → 模拟reads（已知来源位置）

1.2 评估指标¶

基本指标：
  TP(真阳性)：找到了，确实存在
  FP(假阳性)：找到了，实际不存在（误报）
  FN(假阴性)：没找到，实际存在（漏报）
  TN(真阴性)：没找到，确实不存在

衍生指标：
  灵敏度(Recall/Sensitivity) = TP/(TP+FN) → 找全了多少？
  精确度(Precision) = TP/(TP+FP) → 找到的有多少是对的？
  F1 = 2×Precision×Recall/(Precision+Recall) → 综合评分
  特异性(Specificity) = TN/(TN+FP) → 排除了多少假的？

  ROC曲线：Sensitivity vs (1-Specificity)
  PR曲线：Precision vs Recall（类不平衡时更好）

二、变异检测基准测试¶

2.1 使用GIAB标准数据¶

# ===== GIAB(Genome in a Bottle)基准测试 =====
# GIAB提供7个人类参考基因组的"标准答案"
# 最常用：NA12878 (HG001)

# Step 1: 下载GIAB标准数据
# 标准VCF（已知变异）
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz

# 高置信区域BED
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed

# Step 2: 用你的流程检测NA12878的变异
# （假设你已经得到了callset.vcf.gz）

# Step 3: 用hap.py评估
# hap.py是GIAB官方推荐的变异评估工具
# 安装：conda install -c bioconda hap.py

hap.py \
  HG001_benchmark.vcf.gz \          # 标准答案(truth)
  my_callset.vcf.gz \               # 你的结果(query)
  -f HG001_benchmark.bed \          # 高置信区域
  -r GRCh38.fa \                    # 参考基因组
  -o evaluation_results \           # 输出前缀
  --threads 8 \                     # 线程数
  --engine vcfeval                   # 比较引擎(推荐vcfeval)

# hap.py输出解读：
# evaluation_results.summary.csv
# Type   Filter  TRUTH.TP  TRUTH.FN  QUERY.FP  RECALL  PRECISION  F1_SCORE
# SNP    PASS    3200000   50000     30000     0.9846  0.9907     0.9876
# INDEL  PASS    500000    20000     15000     0.9615  0.9709     0.9662

2.2 模拟数据基准测试¶

# ===== 模拟reads基准测试 =====
# 用已知变异生成模拟数据，测试检测能力

# Step 1: 用NEAT/ART模拟reads
# 安装：conda install -c bioconda neat-genreads
neat-genreads \
  -r reference.fa \                  # 参考基因组
  -R 150 \                          # 读段长度150bp
  -c 30 \                           # 30x覆盖度
  -M 0.001 \                        # 突变率0.1%
  --vcf known_variants.vcf \        # 植入已知变异
  -o simulated \                    # 输出前缀
  --pe 300 30                       # 双端测序，插入300bp±30

# Step 2: 用你的流程分析模拟数据
# bwa mem → samtools sort → GATK HaplotypeCaller

# Step 3: 与已知变异比较
bcftools isec \
  -p comparison_dir \               # 输出目录
  known_variants.vcf.gz \           # 标准答案
  my_calls.vcf.gz                    # 你的结果
# 输出：
# 0000.vcf = 只在标准答案中(假阴性FN)
# 0001.vcf = 只在你的结果中(假阳性FP)
# 0002.vcf = 两者共有(真阳性TP)

2.3 Python评估脚本¶

# ===== Python计算基准测试指标 =====
import pandas as pd  # 导入pandas
import numpy as np  # 导入numpy
import matplotlib.pyplot as plt  # 导入matplotlib
from sklearn.metrics import precision_recall_curve, roc_curve, auc  # 导入指标

def calculate_metrics(tp, fp, fn):
    """计算基准测试核心指标"""
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0  # 灵敏度
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0  # 精确度
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0  # F1
    return {
        "TP": tp, "FP": fp, "FN": fn,
        "Recall": round(recall, 4),
        "Precision": round(precision, 4),
        "F1": round(f1, 4)
    }

# 解析hap.py结果
def parse_happy_results(csv_file):
    """解析hap.py的summary.csv"""
    df = pd.read_csv(csv_file)  # 读取CSV
    results = {}  # 存储结果
    for _, row in df.iterrows():  # 遍历每行
        vtype = row["Type"]  # 变异类型(SNP/INDEL)
        filt = row["Filter"]  # 过滤状态
        if filt == "PASS":  # 只看PASS的
            results[vtype] = {
                "Recall": row["METRIC.Recall"],       # 灵敏度
                "Precision": row["METRIC.Precision"],  # 精确度
                "F1": row["METRIC.F1_Score"],          # F1分数
                "TP": row["TRUTH.TP"],                 # 真阳性
                "FP": row["QUERY.FP"],                 # 假阳性
                "FN": row["TRUTH.FN"]                  # 假阴性
            }
    return results

# 多工具对比可视化
def plot_benchmark_comparison(results_dict, output="benchmark_comparison.png"):
    """绘制多工具基准测试对比图"""
    tools = list(results_dict.keys())  # 工具列表
    metrics = ["Recall", "Precision", "F1"]  # 指标列表

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))  # 创建3个子图

    for i, metric in enumerate(metrics):  # 遍历指标
        values = [results_dict[t]["SNP"][metric] for t in tools]  # 提取值
        axes[i].bar(tools, values, color="steelblue")  # 柱状图
        axes[i].set_title(f"SNP {metric}")  # 标题
        axes[i].set_ylim(0.9, 1.0)  # y轴范围
        axes[i].set_ylabel(metric)  # y轴标签
        for j, v in enumerate(values):  # 标注数值
            axes[i].text(j, v + 0.001, f"{v:.4f}", ha="center", fontsize=8)

    plt.tight_layout()  # 紧凑布局
    plt.savefig(output, dpi=300)  # 保存
    plt.close()  # 关闭

三、RNA-seq基准测试¶

# ===== RNA-seq差异表达基准测试 =====
# 使用已知差异表达的模拟数据

# 方法一：用polyester模拟RNA-seq
# R代码：
# library(polyester)
# simulate_experiment("transcripts.fa", reads_per_transcript=300,
#     num_reps=c(3,3), fold_changes=fold_change_matrix)

# 方法二：使用SEQC/MAQC数据
# MAQC提供两个已知浓度比的样本(A和B)
# 混合比例已知 → 可以计算预期差异

# Python评估差异表达结果
import pandas as pd  # 导入pandas

def evaluate_de_results(de_results, truth):
    """评估差异表达分析结果"""
    # de_results: 你的分析结果(基因名 + padj + log2FC)
    # truth: 标准答案(基因名 + 真实差异)

    merged = de_results.merge(truth, on="gene", how="outer")  # 合并

    # 定义阳性：padj < 0.05 且 |log2FC| > 1
    predicted_de = set(merged[
        (merged["padj"] < 0.05) & (abs(merged["log2FC"]) > 1)
    ]["gene"])  # 预测的差异基因

    true_de = set(merged[merged["is_de"] == True]["gene"])  # 真实差异基因

    tp = len(predicted_de & true_de)  # 真阳性
    fp = len(predicted_de - true_de)  # 假阳性
    fn = len(true_de - predicted_de)  # 假阴性

    metrics = calculate_metrics(tp, fp, fn)  # 计算指标
    print(f"差异表达评估: {metrics}")
    return metrics

四、宏基因组基准测试¶

# ===== CAMI挑战赛基准数据 =====
# CAMI(Critical Assessment of Metagenome Interpretation)
# 提供模拟的宏基因组数据集

# 使用OPAL评估分类准确性
# 安装：pip install cami-opal
opal.py \
  -g gold_standard.profile \        # 标准答案(真实物种组成)
  tool1_result.profile \             # 工具1结果
  tool2_result.profile \             # 工具2结果
  tool3_result.profile \             # 工具3结果
  -o opal_output/                    # 输出目录

# OPAL评估指标：
# - 加权UniFrac距离（越小越好）
# - L1 norm误差
# - 各分类层级的Recall/Precision
# - Shannon多样性估计偏差

# ===== Python评估物种组成估计 =====
import pandas as pd  # 导入pandas
import numpy as np  # 导入numpy

def evaluate_composition(predicted, truth, level="genus"):
    """评估物种组成估计的准确性"""
    # Bray-Curtis距离
    merged = pd.merge(
        predicted, truth, on="taxon", how="outer", suffixes=("_pred", "_true")
    ).fillna(0)  # 合并并填充0

    pred_vals = merged["abundance_pred"].values  # 预测丰度
    true_vals = merged["abundance_true"].values  # 真实丰度

    # 归一化
    pred_vals = pred_vals / pred_vals.sum()  # 归一化预测
    true_vals = true_vals / true_vals.sum()  # 归一化真实

    # Bray-Curtis
    bc = np.sum(np.abs(pred_vals - true_vals)) / np.sum(pred_vals + true_vals)

    # 检测到的真实物种数
    true_taxa = set(truth[truth["abundance"] > 0]["taxon"])  # 真实存在的
    pred_taxa = set(predicted[predicted["abundance"] > 0]["taxon"])  # 检测到的

    recall = len(true_taxa & pred_taxa) / len(true_taxa)  # 灵敏度
    precision = len(true_taxa & pred_taxa) / len(pred_taxa) if len(pred_taxa) > 0 else 0

    print(f"{level}水平: BC距离={bc:.4f}, Recall={recall:.4f}, Precision={precision:.4f}")
    return {"bray_curtis": bc, "recall": recall, "precision": precision}

五、性能基准测试¶

# ===== 运行时间和资源监控 =====
import time  # 导入time
import psutil  # 导入psutil（系统资源监控）
import subprocess  # 导入subprocess

def benchmark_tool(command, name="tool"):
    """基准测试一个命令的运行时间和资源使用"""
    # 记录开始状态
    start_time = time.time()  # 开始时间
    start_mem = psutil.virtual_memory().used  # 开始内存

    # 运行命令
    process = subprocess.Popen(
        command, shell=True,
        stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )

    # 监控峰值内存
    peak_mem = 0  # 峰值内存
    while process.poll() is None:  # 进程还在运行
        try:
            proc = psutil.Process(process.pid)  # 获取进程信息
            mem = proc.memory_info().rss  # 物理内存
            peak_mem = max(peak_mem, mem)  # 更新峰值
        except psutil.NoSuchProcess:
            break
        time.sleep(0.1)  # 每0.1秒检查一次

    # 记录结束
    end_time = time.time()  # 结束时间
    wall_time = end_time - start_time  # 总耗时

    result = {
        "tool": name,                         # 工具名
        "wall_time_sec": round(wall_time, 2),  # 总时间(秒)
        "peak_memory_GB": round(peak_mem / 1e9, 2),  # 峰值内存(GB)
        "exit_code": process.returncode        # 退出码
    }

    print(f"{name}: {wall_time:.1f}s, {peak_mem/1e9:.2f}GB RAM")
    return result

# 多工具性能对比
tools = {
    "BWA-MEM2": "bwa-mem2 mem -t 8 ref.fa reads.fq > bwa.sam",
    "Minimap2": "minimap2 -t 8 -a ref.fa reads.fq > mm2.sam",
    "STAR": "STAR --runThreadN 8 --genomeDir star_idx --readFilesIn reads.fq"
}

results = []  # 结果列表
for name, cmd in tools.items():  # 遍历工具
    r = benchmark_tool(cmd, name)  # 运行基准测试
    results.append(r)  # 收集结果

# 汇总表格
benchmark_df = pd.DataFrame(results)  # 创建DataFrame
print(benchmark_df.to_string(index=False))  # 打印表格

六、常见报错与解决¶

问题	原因	解决方案
`hap.py报错`	VCF格式不标准	用bcftools norm规范化VCF
`F1异常低`	坐标系不一致	确认都用hg38或hg19
`内存监控不准`	子进程未跟踪	用/usr/bin/time -v更准确
`模拟数据太简单`	均匀覆盖	加入GC偏好和PCR重复模拟
`基准数据过时`	老版本标准答案	用GIAB v4.2.1最新版
`多线程结果不一致`	浮点累加顺序	固定线程数确保一致

七、面试高频问题¶

Q1: 什么是好的基准测试？¶

A: ①使用公认的标准数据集(GIAB/CAMI/SEQC)而非自造数据；②评估多维指标(不只看灵敏度，还看精确度和F1)；③测试边缘情况(低覆盖、高GC、重复区)；④报告运行时间和资源消耗；⑤与现有方法公平对比(相同数据、相同评估标准)。

Q2: 灵敏度和精确度哪个更重要？¶

A: 取决于应用场景。临床诊断(不能漏)：灵敏度优先，宁可多报也不能漏报；科研筛选(不想做无用实验)：精确度优先，报的要准。F1分数是两者的平衡指标。ROC-AUC适合阈值选择，PR-AUC适合类不平衡情况。

Q3: GIAB样本有什么局限？¶

A: ①只有少数族裔代表(以欧洲血统为主)；②难变异区域(串联重复、segdup)没有标准答案；③不包含体细胞变异(只有胚系)；④不代表真实临床样本的复杂性(肿瘤纯度、FFPE降解等)。GIAB正在扩展到更多族裔样本。

八、速查表¶

# ===== 基准测试速查 =====

# 变异检测评估(hap.py)
hap.py truth.vcf.gz query.vcf.gz -f conf.bed -r ref.fa -o eval

# 核心指标
# Recall = TP/(TP+FN)       找全了多少
# Precision = TP/(TP+FP)    找到的有多少是对的
# F1 = 2*P*R/(P+R)          综合评分

# 标准数据集
# 变异检测: GIAB NA12878 (HG001)
# RNA-seq: SEQC/MAQC
# 宏基因组: CAMI challenge
# 比对: 模拟reads (ART/NEAT)

# 性能监控
/usr/bin/time -v command 2> timing.txt  # 记录时间和内存

# 最佳实践
# 多指标评估（不只看一个指标）
# 多数据集测试（不只在一个样本上测）
# 与现有方法对比（公平比较）
# 报告软件版本和参数