群体间基因流检测 — TreeMix¶

一句话概述：TreeMix 在群体分化树的基础上添加"迁移边"（migration edges），同时推断群体分裂历史和基因流事件，是群体遗传学中检测历史基因交流的标准工具。

核心知识点速查表¶

概念	白话解释
基因流（Gene Flow）	不同群体之间通过杂交/迁移交换遗传物质
迁移边（Migration Edge）	TreeMix 图中表示基因流的箭头
混合图（Admixture Graph）	既有分裂又有混合的群体关系图
等位基因频率漂变	群体分化导致等位基因频率随机改变
残差矩阵	模型拟合不好的群体对（残差大=需要加迁移边）
OptM	确定最佳迁移边数量的 R 包
D-统计量	ABBA-BABA 检验，检测基因流的互补方法
f3/f4 统计量	更精确的基因流检验统计量

一、TreeMix 原理（白话版）¶

比喻：想象一棵家谱树，正常情况下每代只有"分支"（一个族群分成两个）。但现实中不同家族之间可能会通婚（基因流）。TreeMix 就是先画出这棵"分支树"，然后检查哪些"家族"之间的遗传相似度高得不正常——如果高得不正常，就在它们之间画一条"通婚线"（迁移边）。

二、分析流程¶

第 1 步：安装¶

# conda 安装
conda install -c bioconda treemix  # 安装TreeMix

# 或源码编译
# 需要依赖：GSL, Boost
sudo apt-get install libgsl-dev libboost-all-dev  # 安装依赖
wget https://bitbucket.org/nygcresearch/treemix/downloads/treemix-1.13.tar.gz
tar -xzf treemix-1.13.tar.gz
cd treemix-1.13
./configure && make && sudo make install

第 2 步：准备输入文件¶

# TreeMix 需要特殊的等位基因频率输入格式
# 每行一个 SNP，每列一个群体，格式为 "ref_count,alt_count"

# 从 VCF 生成 TreeMix 输入
# 方法1：用 plink + 自定义脚本

# 先生成 PLINK 频率文件
plink --vcf input.vcf.gz \
      --freq \                     # 计算等位基因频率
      --family \                    # 按群体（FID）分组
      --out allele_freq             # 输出前缀

# 方法2：用 stacks 的 populations 模块（如果用的是RAD数据）
# populations -P stacks_dir -M popmap.txt --treemix

# 方法3：直接从 VCF 用 Python 脚本转换
cat << 'PYEOF' > vcf_to_treemix.py
import gzip
import sys
from collections import defaultdict

# 读取群体分配文件（样本名 群体名）
pop_map = {}                                        # 样本->群体映射
with open(sys.argv[2]) as f:                        # 打开群体文件
    for line in f:
        sample, pop = line.strip().split()          # 解析样本和群体
        pop_map[sample] = pop

# 读取VCF并计算频率
with gzip.open(sys.argv[1], 'rt') as f:            # 打开VCF
    for line in f:
        if line.startswith('##'):                   # 跳过注释行
            continue
        if line.startswith('#CHROM'):               # 表头行
            samples = line.strip().split('\t')[9:]  # 获取样本名
            pops = sorted(set(pop_map.values()))    # 获取群体列表
            print(' '.join(pops))                   # 打印群体名表头
            continue
        fields = line.strip().split('\t')
        counts = defaultdict(lambda: [0, 0])        # 每个群体的ref/alt计数
        for i, sample in enumerate(samples):
            gt = fields[9+i].split(':')[0]          # 获取基因型
            pop = pop_map.get(sample, None)
            if pop is None: continue
            alleles = gt.replace('|', '/').split('/')
            for a in alleles:
                if a == '0': counts[pop][0] += 1     # ref计数
                elif a == '1': counts[pop][1] += 1   # alt计数
        # 输出格式：ref,alt  ref,alt  ...
        out = []
        for pop in pops:
            out.append(f"{counts[pop][0]},{counts[pop][1]}")
        print(' '.join(out))
PYEOF

python3 vcf_to_treemix.py input.vcf.gz popmap.txt > treemix_input.txt
gzip treemix_input.txt  # 压缩（TreeMix支持gz输入）

第 3 步：运行 TreeMix（不同迁移边数量）¶

# 运行 m=0 到 m=5 的模型（0到5条迁移边）
for m in $(seq 0 5); do
    treemix -i treemix_input.txt.gz \  # 输入文件
            -m $m \                      # 迁移边数量
            -o treemix_m${m} \          # 输出前缀
            -root outgroup_pop \         # 指定外群（重要！）
            -bootstrap \                 # 做bootstrap
            -k 500 \                     # SNP block大小（考虑LD）
            -seed 12345                  # 随机种子（可重复）
done

# 输出文件说明：
# treemix_m0.treeout.gz   — 树拓扑（Newick格式）
# treemix_m0.vertices.gz  — 节点信息
# treemix_m0.edges.gz     — 边信息
# treemix_m0.covse.gz     — 协方差矩阵标准误
# treemix_m0.modelcov.gz  — 模型拟合的协方差矩阵
# treemix_m0.llik         — 对数似然值

第 4 步：确定最佳迁移边数量（OptM）¶

# ========== R 脚本：用 OptM 确定最佳 m ==========
# install.packages("OptM")
library(OptM)                                       # 加载OptM包

# 读取所有TreeMix运行的似然值
optm_result <- optM("treemix_m",                   # 输出文件前缀
                    method="Evanno")                # 使用Evanno方法

# 画图
pdf("OptM_result.pdf", width=8, height=6)
plot_optM(optm_result)                              # 画Δm图
dev.off()

# Δm 最高点对应的 m 就是最佳迁移边数量
# 类似于 ADMIXTURE 的 CV error 选 K

第 5 步：可视化 TreeMix 结果¶

# ========== R 脚本：画 TreeMix 图 ==========
# TreeMix 自带R脚本
source("treemix-1.13/src/plotting_funcs.R")        # 加载画图函数

# 画迁移树（最常用的图）
pdf("treemix_m3.pdf", width=10, height=8)
plot_tree("treemix_m3")                             # 画带迁移边的树
# 迁移边的颜色表示迁移比例（从黄色到红色）
# 箭头方向表示基因流方向
dev.off()

# 画残差矩阵（检查模型拟合）
pdf("treemix_m3_residuals.pdf", width=8, height=8)
plot_resid("treemix_m3",                            # 画残差图
           pop_order="pop_order.txt")               # 群体排列顺序
dev.off()
# 正残差（蓝色）= 实际比模型更相似 → 可能缺少迁移边
# 负残差（红色）= 实际比模型更不同 → 模型可能过拟合

三、互补方法：D-统计量（ABBA-BABA检验）¶

# D-统计量是检测基因流的另一种方法
# 比 TreeMix 更简单直接，常用于验证

# 用 ADMIXTOOLS 计算 D-统计量
# 安装
conda install -c bioconda admixtools  # 安装ADMIXTOOLS

# 准备参数文件
# D(P1, P2; P3, O) 检验 P3 是否与 P1 或 P2 存在基因流
# D > 0 → P3 与 P2 基因流
# D < 0 → P3 与 P1 基因流
# D ≈ 0 → 无基因流

cat << 'EOF' > d_stat_params.txt
genotypename: mydata.geno
snpname: mydata.snp
indivname: mydata.ind
popfilename: pop_combinations.txt
EOF

# pop_combinations.txt 格式：
# P1  P2  P3  Outgroup
echo "PopA PopB PopC Outgroup" > pop_combinations.txt

# 运行 D-统计量
qpDstat -p d_stat_params.txt > d_stat_results.txt

# 结果解读：
# |Z-score| > 3 → 基因流显著

四、常见报错与解决¶

报错信息	原因	解决方法
`ERROR: no data`	输入文件格式错误	检查空格分隔和 ref,alt 格式
`no root specified`	没有指定外群	用 `-root` 指定外群
迁移边方向不合理	TreeMix 已知限制	用 D-stats 验证方向
加迁移边后似然值不增加	迁移边数量已够	选择较少的 m
残差图仍有大残差	群体历史太复杂	考虑用 qpGraph 或 AdmixtureBayes
`Segmentation fault`	群体数太多	合并近似群体或减少群体数

五、面试高频问题¶

Q1：TreeMix 和 D-统计量的区别？

TreeMix 推断整体群体关系图（所有群体同时分析），D-统计量检验特定四个群体间的基因流。D-统计量更简单、假设更少、统计检验更可靠；TreeMix 给出更完整的图景但可能有局部最优问题。

Q2：如何判断 TreeMix 结果可靠？

(1) 看残差矩阵是否干净（无大正残差）；(2) 多次运行检查一致性；(3) 用 OptM 选择 m；(4) 用 D-stats、f3/f4 等独立方法验证关键迁移事件。

Q3：TreeMix 的 -k 参数是什么意思？

-k 指定 SNP block 大小（默认 1），用于处理 LD。相邻 SNP 不独立（有LD），用 block jackknife 估计标准误。k=500 表示每 500 个 SNP 为一个 block。

六、速查表¶

# === TreeMix 速查 ===

# 运行（m条迁移边）
treemix -i input.txt.gz -m 3 -o out_m3 -root outgroup -bootstrap -k 500

# 批量运行 m=0~5
for m in $(seq 0 5); do treemix -i in.txt.gz -m $m -o out_m${m} -root OG -k 500; done

# 选择最佳m（R）
library(OptM); optM("out_m")

# 可视化（R）
source("plotting_funcs.R")
plot_tree("out_m3")
plot_resid("out_m3", pop_order="order.txt")

# D-统计量验证
qpDstat -p params.txt > d_results.txt
# |Z| > 3 = 显著基因流

参考资料：Pickrell & Pritchard 2012 (PLoS Genetics)、OptM (Fitak 2021)、TreeSwirl 2024 (MBE)、Speciation Genomics Tutorial