627_基因组结构变异检测¶

一句话概述： 结构变异（SV）是≥50bp的基因组变异（包括缺失、重复、倒位、易位、插入），Delly、Manta和SvABA是三大主流SV检测工具——Manta综合F1最高（74%），Delly对倒位检测最优，SvABA对小SV灵敏度最高，实践中推荐多工具联合+取交集。

核心知识点速查表¶

概念	白话解释
Structural Variant (SV)	结构变异——基因组中≥50bp的大段DNA变化
Deletion (DEL)	缺失——一段DNA丢失了
Duplication (DUP)	重复——一段DNA多了一份拷贝
Inversion (INV)	倒位——一段DNA翻转了方向
Translocation (TRA/BND)	易位——DNA片段从一条染色体跳到另一条
Insertion (INS)	插入——在基因组中插入了新的DNA片段
Split-read	分裂reads——一条read的两端比对到不同位置
Discordant read pair	不协调读对——双端reads的距离或方向异常
Read depth	读深度——覆盖某区域的reads数量
Breakend (BND)	断点——SV的连接点

一、安装¶

# === 安装三大SV检测工具 ===
conda install -c bioconda delly      # Delly安装
conda install -c bioconda manta      # Manta安装
conda install -c bioconda svaba      # SvABA安装

# === 辅助工具 ===
conda install -c bioconda survivor   # SV合并和比较
conda install -c bioconda samtools   # BAM处理
conda install -c bioconda bcftools   # VCF处理

# 验证安装
delly --version      # 查看Delly版本
configManta.py --version  # 查看Manta版本
svaba --version      # 查看SvABA版本

二、Delly分析¶

2.1 基本SV检测¶

# === Delly SV检测 ===
# 输入：排序好的BAM文件 + 参考基因组

# 第1步：检测SV
delly call \
    -g reference.fa \              # 参考基因组
    -o delly_sv.bcf \              # 输出BCF格式
    -x delly_exclude.tsv \         # 排除区域（如着丝粒、端粒）
    tumor.bam \                    # 肿瘤样本BAM
    normal.bam                     # 配对正常样本BAM

# 第2步：过滤
delly filter \
    -f somatic \                   # 体细胞过滤模式
    -o delly_filtered.bcf \        # 过滤后的输出
    -s samples.tsv \               # 样本信息（样本名\ttumor/control）
    delly_sv.bcf                   # 输入BCF

# samples.tsv格式：
# tumor_sample   tumor
# normal_sample  control

# 第3步：BCF转VCF
bcftools view delly_filtered.bcf > delly_filtered.vcf  # 转换格式

2.2 排除区域文件¶

# === 下载Delly排除区域文件 ===
# 这些区域的SV检测不可靠（重复序列、着丝粒等）
wget https://github.com/dellytools/delly/blob/main/excludeTemplates/human.hg38.excl.tsv

# 查看排除区域
head human.hg38.excl.tsv  # 格式：chr\tstart\tend

2.3 Germline SV检测¶

# === 种系SV检测（群体遗传学/遗传病）===
# 第1步：每个样本独立检测
for bam in samples/*.bam; do
    name=$(basename "$bam" .bam)
    delly call \
        -g reference.fa \
        -o ${name}_delly.bcf \
        -x human.hg38.excl.tsv \
        ${bam}
done

# 第2步：合并所有样本
delly merge \
    -o sites.bcf \                 # 合并的SV位点
    *_delly.bcf                    # 所有样本的BCF

# 第3步：重新基因分型
for bam in samples/*.bam; do
    name=$(basename "$bam" .bam)
    delly call \
        -g reference.fa \
        -v sites.bcf \             # 使用合并的位点
        -o ${name}_genotyped.bcf \
        ${bam}
done

# 第4步：合并基因分型结果
bcftools merge -m id -O b -o merged_genotyped.bcf *_genotyped.bcf

# 第5步：过滤
delly filter \
    -f germline \                  # 种系过滤模式
    -o germline_filtered.bcf \
    merged_genotyped.bcf

三、Manta分析¶

3.1 体细胞SV检测¶

# === Manta：配置+运行 两步走 ===

# 第1步：配置分析
configManta.py \
    --tumorBam tumor.bam \         # 肿瘤BAM
    --normalBam normal.bam \       # 正常BAM
    --referenceFasta reference.fa \  # 参考基因组
    --runDir manta_output           # 输出目录

# 第2步：运行分析
python2 manta_output/runWorkflow.py \
    -j 16 \                        # 16线程
    -m local                       # 本地运行模式

# === 输出文件说明 ===
# manta_output/results/variants/
#   candidateSmallIndels.vcf.gz    — 小indel候选
#   candidateSV.vcf.gz             — SV候选（未过滤）
#   diploidSV.vcf.gz               — 种系SV（二倍体）
#   somaticSV.vcf.gz               — 体细胞SV（关注这个）

# 解压查看结果
zcat manta_output/results/variants/somaticSV.vcf.gz | grep -v "^#" | wc -l
# 查看体细胞SV数量

3.2 仅肿瘤分析（无配对正常样本）¶

# === 仅肿瘤样本（无配对正常）===
configManta.py \
    --tumorBam tumor.bam \         # 只有肿瘤BAM
    --referenceFasta reference.fa \
    --runDir manta_tumor_only

python2 manta_tumor_only/runWorkflow.py -j 16 -m local

# 注意：没有配对正常样本时，假阳性会更多
# 需要额外过滤步骤（如与gnomAD-SV数据库比对去除常见多态性）

3.3 种系SV检测¶

# === 种系SV检测（遗传病分析）===
configManta.py \
    --bam sample.bam \             # 单个样本（不加--tumorBam）
    --referenceFasta reference.fa \
    --runDir manta_germline

python2 manta_germline/runWorkflow.py -j 16 -m local

# 种系结果在 diploidSV.vcf.gz 中

四、SvABA分析¶

# === SvABA：局部组装+比对 ===
svaba run \
    -t tumor.bam \                 # 肿瘤BAM
    -n normal.bam \                # 正常BAM
    -p 16 \                        # 16线程
    -G reference.fa \              # 参考基因组
    -a svaba_output \              # 输出前缀
    -D dbsnp.vcf.gz                # dbSNP数据库（过滤已知变异）

# === 输出文件 ===
# svaba_output.svaba.somatic.sv.vcf       — 体细胞SV
# svaba_output.svaba.somatic.indel.vcf    — 体细胞indel
# svaba_output.svaba.germline.sv.vcf      — 种系SV
# svaba_output.svaba.germline.indel.vcf   — 种系indel

# SvABA的优势：对小SV（50-300bp）检测灵敏度最高
# 因为它使用局部de novo组装重建断点序列

五、多工具合并（SURVIVOR）¶

# === 用SURVIVOR合并多个工具的结果 ===

# 第1步：准备VCF文件列表
ls delly_filtered.vcf manta_somatic.vcf svaba_somatic.vcf > vcf_list.txt

# 第2步：合并
SURVIVOR merge \
    vcf_list.txt \                 # VCF文件列表
    1000 \                         # 断点距离阈值（bp）
    2 \                            # 至少2个工具支持
    1 \                            # 考虑SV类型
    1 \                            # 考虑SV方向（strand）
    0 \                            # 不考虑距离估计
    50 \                           # 最小SV长度（bp）
    merged_sv.vcf                  # 输出合并VCF

# 统计合并结果
echo "=== 合并后SV统计 ==="
grep -v "^#" merged_sv.vcf | awk '{print $5}' | sort | uniq -c | sort -rn
# 输出各类型SV数量

# 第3步：统计每个工具的贡献
SURVIVOR stats merged_sv.vcf -1 -1 -1 stats_output

六、SV注释与可视化¶

6.1 SV注释¶

# === 用AnnotSV注释SV ===
conda install -c bioconda annotsv  # 安装

AnnotSV \
    -SVinputFile merged_sv.vcf \   # 输入SV VCF
    -outputFile annotated_sv \     # 输出前缀
    -genomeBuild GRCh38 \          # 基因组版本
    -SVminSize 50                  # 最小SV大小

# AnnotSV注释信息包括：
# 受影响的基因、基因功能区域、已知SV数据库比对
# OMIM疾病关联、DGV频率、临床分级

6.2 R语言可视化¶

# === R语言SV统计可视化 ===
library(ggplot2)  # 加载ggplot2
library(dplyr)    # 数据处理

# 读取VCF
library(VariantAnnotation)  # VCF处理
vcf <- readVcf("merged_sv.vcf", "hg38")  # 读取VCF

# 提取SV信息
sv_info <- data.frame(
    type = info(vcf)$SVTYPE,               # SV类型
    length = abs(info(vcf)$SVLEN),          # SV长度
    chr = as.character(seqnames(rowRanges(vcf))),  # 染色体
    support = info(vcf)$SUPP                # 工具支持数
)

# 柱状图：各类型SV数量
ggplot(sv_info, aes(x = type, fill = type)) +
    geom_bar() +
    labs(x = "SV Type", y = "Count",
         title = "Structural Variant Types") +
    theme_minimal() +
    scale_fill_brewer(palette = "Set2")

ggsave("sv_type_distribution.pdf", width = 8, height = 5)

# SV长度分布
ggplot(sv_info[sv_info$length > 0 & sv_info$length < 1e6, ],
       aes(x = log10(length), fill = type)) +
    geom_histogram(bins = 50, alpha = 0.7) +
    facet_wrap(~type, scales = "free_y") +
    labs(x = "log10(SV Length)", y = "Count",
         title = "SV Size Distribution") +
    theme_minimal()

ggsave("sv_size_distribution.pdf", width = 12, height = 8)

七、三工具对比¶

特性	Delly	Manta	SvABA
算法	分裂reads+不协调读对	图论组装+分裂reads	局部de novo组装
速度	中等	最快	最慢
综合F1	~65%	~74%（最高）	~60%
缺失检测	好	最好	好
倒位检测	最好	好	中等
易位检测	好	好	好
小SV（<300bp）	中等	好	最好
输出格式	BCF/VCF	VCF.gz	VCF
配对分析	支持	支持	支持
种系分析	支持	支持	支持

八、常见报错与解决¶

报错信息	原因	解决方案
`BAM not indexed`	BAM文件没有索引	`samtools index sample.bam`
`Reference mismatch`	BAM和参考基因组不匹配	确认用的同一版本参考基因组
`No SV candidates`	覆盖度太低	WGS至少需要30x覆盖度
`Manta configManta.py error`	Python版本问题	Manta需要Python 2.7
`SURVIVOR merge error`	VCF格式不统一	统一VCF header和格式
`Delly out of memory`	样本太多/基因组太大	按染色体分别运行

九、面试高频题¶

Q1：SV检测有哪些主要方法？¶

答：四种信号：(1) 不协调读对（discordant read pairs）——双端reads的插入片段大小或方向异常，说明中间有SV；(2) 分裂reads（split reads）——一条read的两部分比对到不同位置，精确定位断点；(3) 读深度（read depth）——某区域覆盖度异常升高（重复）或降低（缺失）；(4) 局部组装（local assembly）——对断点附近的reads做de novo组装重建SV结构。Delly主要用前两种信号，Manta用图论组装+分裂reads，SvABA用局部de novo组装。最可靠的做法是多工具联合，取至少2个工具一致的结果。

Q2：为什么SV检测比SNV检测难？¶

答：几个原因：(1) SV涉及大段DNA变化，短reads（150bp）难以跨越SV断点——一条read可能全部落在SV内部看不到边界；(2) 基因组中充满重复序列，SV经常发生在重复区域，导致reads比对模糊；(3) SV类型多样（缺失/重复/倒位/易位/插入），每种需要不同的检测策略；(4) 假阳性率高——比对错误、文库构建artifact都容易被误判为SV。长reads测序（PacBio/ONT）能大幅改善SV检测，因为单条read可以跨越整个SV。

Q3：实际项目中怎么选择SV检测策略？¶

答：取决于应用场景：(1) 体细胞SV（肿瘤）——用Manta+Delly联合检测，SURVIVOR合并取至少2个工具支持的SV，这样灵敏度和特异性兼顾；(2) 种系SV（遗传病）——Manta种系模式通常足够，用gnomAD-SV过滤常见多态性；(3) 群体SV分型——Delly的群体分析流程最成熟（merge→genotype→filter）；(4) 小SV（50-300bp）——SvABA最灵敏。数据要求：WGS至少30x覆盖度，WES不适合SV检测（外显子捕获会丢失大部分SV信号）。

Q4：Manta为什么综合表现最好？¶

答： Manta用了"图论组装"方法——先用不协调读对和分裂reads构建断点证据图（breakend evidence graph），然后在图上找最可能的SV路径。优势在于：(1) 自适应——不需要针对每种SV类型单独处理；(2) 断点精度高——通过局部组装精确定位断点到碱基级别；(3) 速度快——并行化做得好，比Delly和SvABA都快；(4) 从小indel到大SV都能检测。基准测试（DREAM challenge等）中Manta的综合F1得分最高（约74%），特别是在缺失检测上表现突出。

参考资料：Delly: Rausch et al., Bioinformatics 2012 | Manta: Chen et al., Bioinformatics 2016 | SvABA: Wala et al., Genome Research 2018 | SURVIVOR: Jeffares et al., Nat Commun 2017 | SV benchmark: Kosugi et al., Genome Biol 2019