PopLDdecay — 群体基因组连锁不平衡衰减分析工具¶

一句话说明¶

PopLDdecay 读取 VCF 文件，计算 SNP 之间的 r² 值（连锁不平衡），然后画出 LD 随基因组距离衰减的曲线，用于评估群体的重组率和历史。白话理解：看基因组里相邻 SNP "捆绑程度"随距离怎么变弱。

安装与配置¶

# 方法1：conda 安装（推荐）
conda install -c bioconda popldecay  # 安装 PopLDdecay

# 方法2：从源码编译
git clone https://github.com/BGI-shenzhen/PopLDdecay.git  # 克隆仓库
cd PopLDdecay/
chmod 755 bin/PopLDdecay  # 给执行权限
./bin/PopLDdecay  # 测试运行

# 确认版本（当前 v3.42）
PopLDdecay --version  # 查看版本

# 可视化需要 R 和 ggplot2
conda install -c conda-forge r-ggplot2 r-dplyr  # 安装 R 绘图包

核心用法¶

基本 LD 分析¶

# 输入：VCF 文件（需先过滤低质量 SNP）
# 基本运行（单个群体）
PopLDdecay \
  -InVCF filtered_snps.vcf.gz \    # 输入 VCF（支持 gzip 压缩）
  -OutStat LD_stat.gz \            # 输出统计文件（自动 gzip 压缩）
  -MaxDist 500 \                   # 最大计算距离（kb），500kb
  -Het 0.88 \                      # 最大杂合度过滤阈值（去掉高杂合SNP）
  -Miss 0.1 \                      # 最大缺失率（去掉10%缺失的SNP）
  -MAF 0.05 \                      # 最小等位基因频率（去掉稀有SNP）
  -Thread 8                        # 使用8个线程

# 输出结果可视化（使用自带的 R 脚本）
perl PopLDdecay/bin/Plot_OnePop.pl \  # 调用 Perl 可视化脚本
  -inFile LD_stat.gz \               # 输入统计文件
  -output LD_decay_plot \            # 输出图片前缀
  -keepR                             # 保留 R 脚本（方便自定义）

参数详解¶

参数	含义	推荐值
`-InVCF`	输入 VCF 文件（.vcf 或 .vcf.gz）	必填
`-OutStat`	输出统计文件前缀	必填
`-MaxDist`	最大计算距离（kb）	200-1000（视基因组大小）
`-Het`	杂合度过滤上限（0-1）	0.88
`-Miss`	缺失率过滤上限（0-1）	0.1
`-MAF`	最小等位基因频率	0.05
`-Thread`	并行线程数	CPU 核心数
`-SubPop`	子群体列表文件	用于分群体分析
`-Bin1`	窗口分组大小1（bp）	100
`-Bin2`	窗口分组大小2（bp）	500
`-r2`	r² 阈值（只输出大于此值的对）	0（默认，输出全部）

实战案例¶

案例1：多个群体 LD 衰减对比¶

# 准备各群体样本 ID 列表文件
# pop1_samples.list: 每行一个样本 ID
# pop2_samples.list: 每行一个样本 ID

# 分别计算各群体的 LD
for pop in pop1 pop2 pop3; do
    PopLDdecay \
      -InVCF all_samples_filtered.vcf.gz \  # 总 VCF 文件
      -SubPop ${pop}_samples.list \          # 该群体的样本列表
      -OutStat ${pop}_LD_stat.gz \           # 各群体输出文件
      -MaxDist 500 \                         # 500kb 范围
      -MAF 0.05 \                            # MAF 过滤
      -Miss 0.1 \                            # 缺失率过滤
      -Thread 16                             # 16 线程
done

# 多群体合并可视化
perl PopLDdecay/bin/Plot_MultiPop.pl \
  -inList pop_list.txt \   # 格式：统计文件路径\t群体名称
  -output multi_LD_plot \  # 输出前缀
  -keepR                   # 保留 R 脚本

案例2：用 Python + R 自定义 LD 衰减图¶

import subprocess  # 运行系统命令
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 读取 PopLDdecay 输出的统计文件
# 文件格式：Dist\tr2\t样本量
ld_data = pd.read_csv(
    "LD_stat.gz",           # gzip 压缩文件
    sep="\t",               # 制表符分隔
    comment="#",            # 跳过注释行
    names=["dist", "r2", "n"]  # 列名
)

# 按距离分窗口（bin）计算平均 r²
ld_data["dist_kb"] = ld_data["dist"] / 1000  # 转换为 kb

# 100 bp 一个 bin
bins = np.arange(0, ld_data["dist_kb"].max() + 1, 0.1)  # 0.1kb 一个 bin
ld_data["bin"] = pd.cut(ld_data["dist_kb"], bins=bins,
                         labels=bins[:-1])             # 分组
ld_binned = ld_data.groupby("bin")["r2"].mean()        # 每组平均 r²

# 绘图
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(ld_binned.index.astype(float),  # x轴：距离
        ld_binned.values,               # y轴：平均 r²
        color="#1f77b4",                # 蓝色
        linewidth=2,
        label="Population")
ax.axhline(y=0.1, color="red",          # 画 r²=0.1 参考线
           linestyle="--", label="r²=0.1")
ax.set_xlabel("Distance (kb)")
ax.set_ylabel("Mean r²")
ax.set_title("LD Decay")
ax.legend()
plt.tight_layout()
plt.savefig("LD_decay.pdf", dpi=300)   # 保存为 PDF
print("图片已保存：LD_decay.pdf")

常见报错与解决¶

报错	原因	解决方法
`VCF file not found`	路径错误或文件未解压	检查路径；bgzip+tabix 格式
LD 值全为 1	SNP 过少或样本量太小	至少需要 20 个样本；检查过滤条件
输出为空	MAF 过滤太严或 VCF 无 GT 字段	放宽 MAF 阈值；确认 VCF 含 GT 格式字段
R 脚本报错	ggplot2 未安装	`install.packages("ggplot2")`
计算极慢	MaxDist 设置太大	先用小 MaxDist（100kb）测试

速查表¶

# 安装
conda install -c bioconda popldecay

# 基本命令
PopLDdecay -InVCF snps.vcf.gz -OutStat ld.gz -MaxDist 500 -MAF 0.05 -Miss 0.1

# 多群体分析用 -SubPop 传入样本列表文件（每行一个样本ID）

# 可视化脚本（自带）
perl bin/Plot_OnePop.pl    # 单群体
perl bin/Plot_MultiPop.pl  # 多群体对比

# r² 含义
# 1.0 = 完全连锁不平衡（两SNP总是一起出现）
# 0.0 = 完全独立（随机组合）
# 通常 r² 半衰距离（降至 0.5 的距离）反映重组率

# LD 衰减快 → 群体历史上重组多，或经历瓶颈后扩张
# LD 衰减慢 → 近交系/驯化品种，选择扫荡区域