多物种比较转录组¶

一句话概述¶

跨物种RNA-seq比较分析——通过直系同源基因（orthogroup）映射，比较不同物种间基因表达模式的保守性与分歧，揭示表达进化规律和物种特异性适应。

核心知识点总览¶

知识点	关键内容	重要程度
直系同源鉴定	OrthoFinder/OMA/eggNOG同源基因识别	⭐⭐⭐⭐⭐
表达量可比性	跨物种归一化(TPM/FPKM对齐)	⭐⭐⭐⭐⭐
表达保守性	Spearman相关/Tau指数跨物种	⭐⭐⭐⭐
差异表达进化	物种特异性上调/下调基因	⭐⭐⭐⭐
组织特异性保守	同源组织间表达模式比较	⭐⭐⭐
表达进化速率	表达散度与序列散度关系	⭐⭐⭐
系统发育方法	表达树/进化模型	⭐⭐⭐
基因复制后功能分化	Paralog表达分歧	⭐⭐⭐

各步骤详解¶

第一步：直系同源基因鉴定¶

白话解释： 跨物种比较表达量之前，必须先确定哪些基因是"同一个基因"在不同物种中的对应版本——即直系同源基因（orthologs）。OrthoFinder是目前最准确的工具，它从蛋白序列相似性出发，综合利用基因树和物种树来判断同源关系。

技术细节：

# === OrthoFinder直系同源鉴定 ===

# 准备各物种的蛋白序列文件
mkdir proteomes
cp human.pep.fa proteomes/Human.fa
cp mouse.pep.fa proteomes/Mouse.fa
cp zebrafish.pep.fa proteomes/Zebrafish.fa
cp chicken.pep.fa proteomes/Chicken.fa

# 运行OrthoFinder
orthofinder -f proteomes/ -t 16 -a 8

# 关键输出文件：
# Results/Orthogroups/Orthogroups.tsv          - 所有orthogroup
# Results/Orthogroups/Orthogroups_SingleCopyOrthologues.txt - 1:1直系同源
# Results/Orthologues/Human__v__Mouse.tsv      - 成对同源关系
# Results/Gene_Trees/                          - 每个OG的基因树

# 提取1:1同源对（最适合跨物种比较）
# 1:1同源：每个物种只有一个基因的orthogroup
cat Results/Orthogroups/Orthogroups_SingleCopyOrthologues.txt | wc -l

# 解析OrthoFinder结果，构建跨物种基因映射
import pandas as pd

# 读取orthogroups
og = pd.read_csv("Orthogroups.tsv", sep="\t", index_col=0)

# 提取1:1:1:1同源基因（所有物种各一个）
one_to_one = og.dropna()  # 没有缺失物种
one_to_one = one_to_one[one_to_one.map(lambda x: ',' not in str(x)).all(axis=1)]
# 排除含多个paralogs的行
# 注意：pandas >= 2.1 中 applymap 已弃用，改用 map

print(f"1:1:1:1 orthologs: {len(one_to_one)}")
# 通常在脊椎动物间有~10000-13000个1:1同源

# 创建基因ID映射表
gene_mapping = pd.DataFrame({
    'Human': one_to_one['Human'].str.strip(),
    'Mouse': one_to_one['Mouse'].str.strip(),
    'Zebrafish': one_to_one['Zebrafish'].str.strip(),
    'Chicken': one_to_one['Chicken'].str.strip()
})
gene_mapping.to_csv("ortho_gene_mapping.csv", index=True)

第二步：跨物种表达量归一化¶

白话解释： 不同物种的基因组大小、基因长度、测序策略不同，直接比较TPM值不太合理。需要做额外的归一化来让跨物种表达量具有可比性。常用方法包括：quantile归一化、中位数对齐、或者只比较排名（非参方法）。

技术细节：

# === 跨物种表达量准备与归一化 ===
library(dplyr)

# 读取各物种的TPM矩阵
human_tpm <- read.csv("human_tpm.csv", row.names = 1)
mouse_tpm <- read.csv("mouse_tpm.csv", row.names = 1)
zebrafish_tpm <- read.csv("zebrafish_tpm.csv", row.names = 1)

# 读取基因映射
mapping <- read.csv("ortho_gene_mapping.csv", row.names = 1)

# 提取1:1同源基因的表达量
human_ortho <- human_tpm[mapping$Human, , drop = FALSE]
mouse_ortho <- mouse_tpm[mapping$Mouse, , drop = FALSE]
zebrafish_ortho <- zebrafish_tpm[mapping$Zebrafish, , drop = FALSE]

# 统一行名为orthogroup ID
rownames(human_ortho) <- rownames(mapping)
rownames(mouse_ortho) <- rownames(mapping)
rownames(zebrafish_ortho) <- rownames(mapping)

# log2转换
human_log <- log2(human_ortho + 1)
mouse_log <- log2(mouse_ortho + 1)
zebrafish_log <- log2(zebrafish_ortho + 1)

# 方法1：Quantile归一化跨物种
library(preprocessCore)
# 合并后统一quantile normalize
all_expr <- cbind(human_log, mouse_log, zebrafish_log)
all_qn <- normalize.quantiles(as.matrix(all_expr))
colnames(all_qn) <- colnames(all_expr)
rownames(all_qn) <- rownames(all_expr)

# 方法2：中位数对齐
# 每个物种的每个样本减去中位数，加上全局中位数
global_median <- median(c(as.matrix(human_log), as.matrix(mouse_log)))
human_aligned <- sweep(human_log, 2, apply(human_log, 2, median) - global_median)
mouse_aligned <- sweep(mouse_log, 2, apply(mouse_log, 2, median) - global_median)

第三步：表达保守性分析¶

白话解释： 核心问题：同源基因在不同物种的同源组织中，表达模式有多相似？通过计算跨物种表达相关性，可以量化表达进化的保守程度。一般来说，管家基因（housekeeping）高度保守，而免疫/生殖相关基因进化较快。

技术细节：

# === 表达保守性分析 ===

# 假设有多组织数据：brain, liver, heart, kidney
# 计算同源组织间的表达相关性

# 人与鼠brain表达相关
cor_brain <- cor(human_log[, "brain"], mouse_log[, "brain"],
                  method = "spearman")
cat(sprintf("Human-Mouse brain Spearman r = %.3f\n", cor_brain))

# 全组织表达谱相关矩阵
tissues <- c("brain", "liver", "heart", "kidney")
cor_matrix <- matrix(NA, length(tissues), length(tissues))
rownames(cor_matrix) <- paste0("Human_", tissues)
colnames(cor_matrix) <- paste0("Mouse_", tissues)

for (i in seq_along(tissues)) {
  for (j in seq_along(tissues)) {
    cor_matrix[i, j] <- cor(human_log[, tissues[i]], mouse_log[, tissues[j]],
                             method = "spearman", use = "complete.obs")
  }
}

# 热图可视化
library(pheatmap)
pheatmap(cor_matrix, display_numbers = TRUE,
         main = "Cross-species tissue expression correlation")
# 期望：对角线最高（同源组织最相似）

# 每个基因的跨物种保守性评分
gene_conservation <- sapply(rownames(human_log), function(gene) {
  h_expr <- as.numeric(human_log[gene, tissues])
  m_expr <- as.numeric(mouse_log[gene, tissues])
  cor(h_expr, m_expr, method = "spearman")
})

# 保守基因 vs 快进化基因
highly_conserved <- names(sort(gene_conservation, decreasing = TRUE)[1:500])
rapidly_evolving <- names(sort(gene_conservation)[1:500])

# 功能富集
library(clusterProfiler)
ego_conserved <- enrichGO(gene = mapping$Human[rownames(mapping) %in% highly_conserved],
                           OrgDb = org.Hs.eg.db, keyType = "SYMBOL", ont = "BP")
ego_divergent <- enrichGO(gene = mapping$Human[rownames(mapping) %in% rapidly_evolving],
                           OrgDb = org.Hs.eg.db, keyType = "SYMBOL", ont = "BP")

第四步：表达进化速率与选择压力¶

白话解释： 可以类比序列进化的Ka/Ks（非同义/同义替换比），定义"表达散度"来衡量基因表达进化的速率。表达变化快的基因可能经历了表达水平的正选择（如适应新环境），而表达高度保守的基因可能受到纯化选择约束。

技术细节：

# === 表达进化速率 ===

# 方法1：表达散度（Expression Divergence）
# 使用1-Spearman r作为表达距离
expr_divergence <- 1 - gene_conservation

# 方法2：EVE模型（Expression Variance and Evolution）
# EVE区分种内变异（noise）和种间进化（真实变化）
# 类似于McDonald-Kreitman检验的表达版本

# 方法3：Ornstein-Uhlenbeck模型
# 使用phylogenetic方法建模表达进化
# 检验是否有表达最优值的约束（稳定化选择）

# 表达散度 vs 序列散度的关系
# 读取Ka/Ks值
kaks <- read.csv("pairwise_kaks.csv", row.names = 1)
merged_evol <- data.frame(
  gene = rownames(kaks),
  ka_ks = kaks$Ka_Ks,
  expr_div = expr_divergence[rownames(kaks)]
)

# 散点图
library(ggplot2)
ggplot(merged_evol, aes(x = ka_ks, y = expr_div)) +
  geom_point(alpha = 0.3, size = 0.5) +
  geom_smooth(method = "lm") +
  labs(x = "Ka/Ks (sequence divergence)",
       y = "Expression divergence (1 - Spearman r)") +
  theme_minimal()

# 通常正相关但弱——表达和序列进化部分解耦
cor.test(merged_evol$ka_ks, merged_evol$expr_div, method = "spearman")

第四步补充：正选择检测（PAML/codeml）¶

白话解释： 除了表达水平的进化，编码序列本身的进化也很重要。PAML（Phylogenetic Analysis by Maximum Likelihood）中的codeml程序通过计算非同义替换率(dN)与同义替换率(dS)的比值omega(dN/dS)来检测正选择：omega>1表示正选择（蛋白功能改变被选择保留），omega<1表示纯化选择（保守），omega=1表示中性进化。

技术细节：

# === PAML/codeml 正选择检测 ===

# 前提：需要多物种CDS比对和物种树

# 1. 准备CDS比对（codon-aware alignment）
# 使用PRANK或MACSE做保留密码子结构的比对
prank -d=orthogroup.cds.fa -t=species_tree.nwk -o=aligned -codon

# 2. 转换为PAML格式
# codeml需要phylip格式的比对文件

# 3. 编写codeml控制文件（位点模型: M7 vs M8）
cat > codeml_site.ctl << 'EOF'
      seqfile = aligned.phy       * 序列比对文件
     treefile = species_tree.nwk  * 物种树
      outfile = site_results.txt  * 输出文件
        noisy = 3
      verbose = 1
      runmode = 0                 * 0=用户提供树
      seqtype = 1                 * 1=密码子
    CodonFreq = 2                 * 2=F3x4模型
        model = 0                 * 0=所有分支同一omega
      NSsites = 7 8              * 7=M7(beta), 8=M8(beta+正选择)
        icode = 0                 * 0=标准遗传密码
    fix_omega = 0                 * 0=估计omega
        omega = 0.5               * 初始omega值
EOF

# 运行codeml
codeml codeml_site.ctl

# 4. 分支-位点模型（检测特定谱系的正选择）
cat > codeml_branchsite.ctl << 'EOF'
      seqfile = aligned.phy
     treefile = species_tree_marked.nwk  * 前景分支用#1标记
      outfile = branchsite_results.txt
      runmode = 0
      seqtype = 1
    CodonFreq = 2
        model = 2                 * 2=分支-位点模型
      NSsites = 2                 * 2=正选择
    fix_omega = 0                 * 备选假说：omega自由估计
        omega = 1.5
EOF
# 零假说：fix_omega=1, omega=1（不允许正选择）
# 用似然比检验(LRT)比较两个模型：2*(lnL1-lnL0)服从卡方分布(df=1)
# BEB(Bayes Empirical Bayes)方法识别具体受正选择的位点

codeml codeml_branchsite.ctl

# === 解析codeml结果 ===
# M7 vs M8比较
# 从输出文件提取lnL值
# LRT statistic = 2 * (lnL_M8 - lnL_M7)
# 自由度 = 2（M8比M7多2个参数）
# p = pchisq(LRT, df=2, lower.tail=FALSE)

# BEB结果中 posterior probability > 0.95 的位点为正选择位点

第五步：物种特异性表达变化¶

白话解释： 哪些基因在某个物种中"独特地"改变了表达？比如人类特有的大脑高表达基因可能与人类认知进化相关。通过比较系统发育分支上的表达变化，可以识别物种特异性的表达适应。

技术细节：

# === 物种特异性表达变化检测 ===

# 方法：在系统发育背景下识别某物种偏离祖先状态的基因

# 简单方法：Z-score
# 对每个基因，计算人类表达在所有物种中的Z-score
# Z > 2: 人类特异性高表达
# Z < -2: 人类特异性低表达

species_mean_expr <- data.frame(
  Human = rowMeans(human_log),
  Mouse = rowMeans(mouse_log),
  Zebrafish = rowMeans(zebrafish_log)
)

# 对每个基因计算人类的Z-score
species_mean_expr$human_zscore <- scale(species_mean_expr$Human)

# 人类特异性高表达基因（相对于其他物种）
for (i in 1:nrow(species_mean_expr)) {
  vals <- as.numeric(species_mean_expr[i, 1:3])
  species_mean_expr$human_zscore[i] <- (vals[1] - mean(vals)) / sd(vals)
}

human_specific_up <- rownames(species_mean_expr)[species_mean_expr$human_zscore > 2]
human_specific_down <- rownames(species_mean_expr)[species_mean_expr$human_zscore < -2]

cat(sprintf("Human-specific upregulated: %d\n", length(human_specific_up)))
cat(sprintf("Human-specific downregulated: %d\n", length(human_specific_down)))

第六步：组织表达树构建¶

白话解释： 有趣的问题是：相同组织在不同物种间更相似，还是同物种的不同组织更相似？如果把每个"物种-组织"组合当作一个样本做聚类，是按组织还是按物种先聚在一起？这揭示了组织身份vs物种身份哪个主导转录组模式。

技术细节：

# === 表达距离树 ===

# 合并所有物种-组织的平均表达谱
all_profiles <- cbind(
  human_log[, tissues],
  mouse_log[, tissues],
  zebrafish_log[, tissues]
)
colnames(all_profiles) <- c(
  paste0("Human_", tissues),
  paste0("Mouse_", tissues),
  paste0("Zebrafish_", tissues)
)

# 计算距离矩阵
dist_matrix <- as.dist(1 - cor(all_profiles, method = "spearman"))

# 层次聚类
hc <- hclust(dist_matrix, method = "average")
plot(hc, main = "Expression-based clustering of species-tissues",
     xlab = "", sub = "")

# 观察聚类模式：
# 如果Brain_Human先与Brain_Mouse聚，说明脑的表达谱跨物种保守
# 如果Human_Brain先与Human_Liver聚，说明物种效应>组织效应
# 通常结果：高度保守组织（脑、睾丸）跨物种聚类，其他按物种聚类

第七步：基因复制后的表达分化¶

白话解释： 基因复制（duplication）产生paralogs后，两个拷贝可能经历表达分化——一个保留原始表达模式，另一个获得新的表达模式（如组织特异性变化）。这是新功能产生的重要机制。

技术细节：

# === Paralog表达分化 ===

# 提取含paralogs的orthogroups
# 从OrthoFinder结果中找人类有两个基因的orthogroup
og_all <- read.table("Orthogroups.tsv", sep = "\t", header = TRUE, row.names = 1)
human_paralogs <- og_all[grepl(",", og_all$Human), ]

# 对每对paralog计算表达相关性
paralog_cors <- numeric()
for (i in 1:nrow(human_paralogs)) {
  genes <- strsplit(human_paralogs$Human[i], ", ")[[1]]
  if (length(genes) == 2 && all(genes %in% rownames(human_tpm))) {
    r <- cor(as.numeric(human_tpm[genes[1], ]),
             as.numeric(human_tpm[genes[2], ]),
             method = "spearman")
    paralog_cors <- c(paralog_cors, r)
  }
}

# 与1:1同源基因的保守性比较
hist(paralog_cors, breaks = 50, main = "Paralog Expression Correlation",
     xlab = "Spearman r", col = "lightblue")
abline(v = median(gene_conservation), col = "red", lty = 2,
       lwd = 2)  # 同源基因保守性作为参考

# 表达高度分化的paralog（neo-functionalization的候选）

实战命令速查¶

# 同源基因鉴定
orthofinder -f proteomes/ -t 16 -a 8

# 各物种RNA-seq标准分析
hisat2 -x species_index -1 R1.fq.gz -2 R2.fq.gz | samtools sort -o aligned.bam
stringtie aligned.bam -G species.gtf -e -B -o quant.gtf
# 或salmon定量
salmon quant -i species_index -l A -1 R1.fq.gz -2 R2.fq.gz -o salmon_out

面试常问点¶

Q1: 跨物种比较RNA-seq时，为什么不能直接比较TPM值？¶

A: TPM在单物种内可比（归一化了基因长度和库大小），但跨物种存在多重不可比因素：(1) 基因组注释完整度不同（注释差的物种TPM被高估）；(2) 基因长度定义不同（UTR注释差异）；(3) 转录组复杂度不同；(4) 同源基因可能有不同数量的paralog分散表达。需要额外的跨物种归一化步骤。

Q2: 1:1同源基因和多对多同源有什么区别？如何处理？¶

A: 1:1同源（无复制事件）最适合直接表达比较。多对多关系（如人有2个基因对应鼠的3个基因）通常因基因复制产生。处理方法：(1) 只用1:1同源做严格比较；(2) 对多对多关系取所有paralogs的总表达量或最高表达量代表该orthogroup；(3) 分别分析每个paralog的表达分化。

Q3: 跨物种比较中如何控制组织采样差异？¶

A: 不同物种的"肝脏"不可能完全对应（细胞组成、发育阶段差异）。应对策略：(1) 使用发育时间匹配的样本（如Theiler stage）；(2) 使用单细胞数据对齐细胞类型后比较；(3) 使用大量样本取平均减少个体差异；(4) 使用marker基因验证组织对应关系。

Q4: 基因表达进化遵循什么模型？¶

A: 多数基因表达进化遵循类似布朗运动的随机漂变+稳定化选择（Ornstein-Uhlenbeck模型）——表达在一个最优值附近波动。管家基因的OU约束强（表达高度保守），而调控复杂的基因（如转录因子）约束弱、进化快。少数基因表现出方向性选择（正选择），如人类大脑相关基因。

Q5: 表达进化速率与什么因素相关？¶

A: 与表达进化速率正相关的因素：(1) 蛋白质序列进化速率（Ka/Ks）；(2) 基因组位置的重组率；(3) 组织特异性（特异性强的进化快）；(4) 基因年龄（新基因进化快）。负相关因素：(1) 表达水平（高表达基因保守）；(2) 蛋白互作伙伴数（hub基因保守）；(3) 发育必需性。

易错点¶

1. 同源关系判断错误¶

使用简单的reciprocal best BLAST（RBH）而非OrthoFinder等现代方法，会遗漏many-to-many orthology和in-paralogs。特别是在全基因组复制（WGD）发生的物种对中，RBH效果很差。

2. 批次效应伪装成物种效应¶

不同物种的RNA-seq如果在不同实验室/批次完成，批次效应可能与物种效应完全混淆。最佳方案是同一实验条件处理所有物种样本，或使用spike-in进行跨批次校正。

3. 基因组注释质量差异被忽略¶

注释不完整的物种会有更多"未检测到"的基因，导致表达保守性被低估。应只使用在所有比较物种中都有可靠注释的基因。

4. 忽略基因复制对比较的影响¶

如果某基因在一个物种中有2个paralogs各表达50%，在另一物种中只有1个表达100%，直接比较单基因的表达量会得出"表达降低"的错误结论。应将paralogs的总表达量作为比较单位。

5. 进化距离过大时的分析假设违反¶

比较进化距离极远的物种（如人与果蝇）时，组织同源性难以确定，序列同源关系也可能有误。应限制在合理的进化距离内进行比较。

补充知识¶

跨物种比较资源¶

Bgee：跨物种基因表达数据库（标准化条件）
Expression Atlas：EBI多物种表达数据
OMA Browser：高质量同源关系数据库
PhyloPro：基因表达进化可视化

前沿方向¶

单细胞跨物种比较（SAMap/scAlign）
基因调控网络的跨物种保守性
增强子活性的跨物种比较
非编码RNA表达的进化保守性

引用推荐¶

OrthoFinder: Emms & Kelly, Genome Biology, 2019
跨物种表达进化: Brawand et al., Nature, 2011
表达进化模型: Bedford & Hartl, PNAS, 2009
PAML/codeml: Yang, Molecular Biology and Evolution, 2007
PAML正选择入门: Álvarez-Carretero et al., Molecular Biology and Evolution, 2023

多物种比较转录组¶

一句话概述¶

核心知识点总览¶

各步骤详解¶

第一步：直系同源基因鉴定¶

第二步：跨物种表达量归一化¶

第三步：表达保守性分析¶

第四步：表达进化速率与选择压力¶

第四步补充：正选择检测（PAML/codeml）¶

第五步：物种特异性表达变化¶

第六步：组织表达树构建¶

第七步：基因复制后的表达分化¶

实战命令速查¶

面试常问点¶

Q1: 跨物种比较RNA-seq时，为什么不能直接比较TPM值？¶

Q2: 1:1同源基因和多对多同源有什么区别？如何处理？¶

Q3: 跨物种比较中如何控制组织采样差异？¶

Q4: 基因表达进化遵循什么模型？¶

Q5: 表达进化速率与什么因素相关？¶

易错点¶

1. 同源关系判断错误¶

2. 批次效应伪装成物种效应¶

3. 基因组注释质量差异被忽略¶

4. 忽略基因复制对比较的影响¶

5. 进化距离过大时的分析假设违反¶

补充知识¶

跨物种比较资源¶

前沿方向¶

引用推荐¶

📚 相关文章推荐