380_单细胞基因集评分方法比较¶

一句话说明¶

基因集评分是把"一组基因"的集体表达信号压缩成"一个数"，不同方法各有优劣，选对方法很重要。

核心知识点¶

要点1：为什么需要基因集评分¶

单个基因表达在单细胞中噪声极大（dropout问题）
一组相关基因的集体信号比单基因更稳健
把高维基因空间压缩为低维的生物学概念（通路、细胞状态）

要点2：六种主流方法对比¶

方法	原理	优点	缺点
AddModuleScore	目标基因均值-背景均值	快速简单	不考虑基因协调性
AUCell	基因在Top排名中的AUC	对dropout稳健	需调参（Top比例）
ssGSEA	排序累积和的面积	统计严格	慢，不太适合单细胞
VISION	消除批次后的局部评分	考虑空间结构	复杂
UCell	修正的Mann-Whitney统计量	单细胞专优化	新工具，文献较少
decoupler	多方法框架（ULM/MLM）	统一接口，方法多	需要网络数据库

要点3：UCell——单细胞专属评分¶

2021年发表于 Bioinformatics
基于 Mann-Whitney U 统计量，不受细胞总read数影响
对稀疏矩阵优化，速度快
推荐作为 AddModuleScore 的升级替代

要点4：GSVA 适合伪批量（pseudo-bulk）分析¶

GSVA原本为RNA-seq矩阵设计
单细胞数据不建议直接用GSVA（稀疏性问题）
如果先做伪批量（把同细胞类型的reads合并），再用GSVA效果更好

要点5：选择指南¶

场景A：快速探索 → AddModuleScore（Seurat内置，最省事）
场景B：稀疏性问题严重 → AUCell 或 UCell
场景C：发表级别严格分析 → UCell + 多方法交叉验证
场景D：有完整调控网络数据 → decoupler（ULM/MLM）
场景E：空间转录组联合 → VISION

实战代码¶

四种方法同时计算并比较¶

library(Seurat)      # 单细胞分析
library(UCell)       # 单细胞优化的基因集评分
library(AUCell)      # AUC评分
library(ggplot2)
library(dplyr)

# ============================================================
# 准备测试基因集（肿瘤相关，典型应用场景）
# ============================================================
gene_sets <- list(
  Hypoxia     = c("HIF1A", "VEGFA", "SLC2A1", "LDHA", "PGK1", "PGAM1", "ENO1"),
  Proliferation = c("MKI67", "TOP2A", "PCNA", "MCM2", "CDK4", "CCND1"),
  EMT         = c("VIM", "CDH2", "SNAI1", "SNAI2", "ZEB1", "FN1", "COL1A1")
)

# ============================================================
# 方法1：AddModuleScore
# ============================================================
seurat_obj <- AddModuleScore(seurat_obj, features = gene_sets["Hypoxia"],
                              name = "AMS_Hypoxia", seed = 42)
seurat_obj <- AddModuleScore(seurat_obj, features = gene_sets["Proliferation"],
                              name = "AMS_Prolif", seed = 42)
seurat_obj <- AddModuleScore(seurat_obj, features = gene_sets["EMT"],
                              name = "AMS_EMT", seed = 42)

# ============================================================
# 方法2：UCell（推荐替代）
# ============================================================
seurat_obj <- AddModuleScore_UCell(
  seurat_obj,
  features = gene_sets,     # 一次性输入所有基因集
  maxRank = 1500,           # 参与排名的最大基因数（影响敏感度）
  ncores = 4
)
# 结果存储在 meta.data 中，列名格式：基因集名_UCell

# ============================================================
# 方法3：AUCell
# ============================================================
# 提取表达矩阵
expr_matrix <- GetAssayData(seurat_obj, layer = "counts")  # 使用raw counts

# 计算基因排名（每个细胞内基因按表达量排序）
cells_rankings <- AUCell_buildRankings(
  expr_matrix,
  nCores = 4,
  plotStats = FALSE
)

# 计算 AUC 分数
cells_AUC <- AUCell_calcAUC(
  gene_sets,               # 基因集
  cells_rankings,          # 细胞内基因排名
  aucMaxRank = ceiling(0.05 * nrow(cells_rankings))  # 用前5%基因计算AUC
)

# 提取分数矩阵
aucell_scores <- as.data.frame(t(getAUC(cells_AUC)))   # 细胞×基因集

# 将AUCell分数添加到元数据
seurat_obj$AUCell_Hypoxia <- aucell_scores[rownames(seurat_obj@meta.data), "Hypoxia"]
seurat_obj$AUCell_EMT     <- aucell_scores[rownames(seurat_obj@meta.data), "EMT"]

# ============================================================
# 方法比较：相关性分析
# ============================================================
method_compare <- seurat_obj@meta.data %>%
  select(
    AMS_Hypoxia1,     # AddModuleScore结果（注意有数字后缀）
    Hypoxia_UCell,    # UCell结果
    AUCell_Hypoxia    # AUCell结果
  ) %>%
  rename(
    AddModuleScore = AMS_Hypoxia1,
    UCell = Hypoxia_UCell,
    AUCell = AUCell_Hypoxia
  )

# 计算三种方法之间的相关性
cor_matrix <- cor(method_compare, method = "spearman", use = "pairwise.complete.obs")
print(round(cor_matrix, 3))    # 通常三种方法高度相关（r>0.8）

# 散点图比较
library(GGally)
GGally::ggpairs(method_compare,
  upper = list(continuous = wrap("cor", method = "spearman")),
  title = "三种基因集评分方法比较（缺氧通路）"
)

Scanpy/Python 版本（decoupler统一框架）¶

import decoupler as dc   # pip install decoupler
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

adata = sc.read_h5ad("preprocessed.h5ad")

# 从 MSigDB 获取通路基因集（需要 pydeseq2 或直接下载gmt）
# 简单起见，手动定义基因集
gene_sets_dict = {
    "Hypoxia":       ["HIF1A", "VEGFA", "SLC2A1", "LDHA", "PGK1"],
    "Proliferation": ["MKI67", "TOP2A", "PCNA", "MCM2", "CDK4"],
    "EMT":           ["VIM", "CDH2", "SNAI1", "ZEB1", "FN1"]
}

# 转换为 decoupler 需要的 net DataFrame
net_rows = []
for pathway, genes in gene_sets_dict.items():
    for gene in genes:
        if gene in adata.var_names:    # 只保留数据中存在的基因
            net_rows.append({"source": pathway, "target": gene, "weight": 1.0})
net_df = pd.DataFrame(net_rows)

# ============================================================
# 方法1：ULM（单变量线性模型）
# ============================================================
dc.run_ulm(mat=adata, net=net_df, source="source", target="target",
           weight="weight", use_raw=False)
ulm_scores = pd.DataFrame(adata.obsm["ulm_estimate"],
                           index=adata.obs_names,
                           columns=adata.uns["ulm_estimate_names"])

# ============================================================
# 方法2：MLM（多变量线性模型）
# ============================================================
dc.run_mlm(mat=adata, net=net_df, source="source", target="target",
           weight="weight", use_raw=False)
mlm_scores = pd.DataFrame(adata.obsm["mlm_estimate"],
                           index=adata.obs_names,
                           columns=adata.uns["mlm_estimate_names"])

# ============================================================
# 方法比较图
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ULM 分数
axes[0].scatter(ulm_scores["Hypoxia"], ulm_scores["Proliferation"],
                alpha=0.3, s=5, c="steelblue")
axes[0].set_xlabel("缺氧评分 (ULM)")
axes[0].set_ylabel("增殖评分 (ULM)")
axes[0].set_title("ULM 基因集评分")

# MLM 分数
axes[1].scatter(mlm_scores["Hypoxia"], mlm_scores["Proliferation"],
                alpha=0.3, s=5, c="coral")
axes[1].set_xlabel("缺氧评分 (MLM)")
axes[1].set_ylabel("增殖评分 (MLM)")
axes[1].set_title("MLM 基因集评分")

plt.tight_layout()
plt.savefig("method_comparison.pdf", dpi=300)

# 方法间相关性
from scipy.stats import spearmanr
r, p = spearmanr(ulm_scores["Hypoxia"], mlm_scores["Hypoxia"])
print(f"ULM vs MLM Spearman r = {r:.3f}, p = {p:.2e}")

面试常问点¶

AddModuleScore的背景基因是怎么选的？ 按表达水平分100个bins，从目标基因所在的同一bin中随机采样，目的是消除表达量偏差
UCell相比AUCell的优势？ UCell用Mann-Whitney统计量，在极稀疏的单细胞数据中比AUCell更稳定
基因集太小（<10个基因）时哪种方法更好？ UCell，因为AUCell在基因集小时AUC估计不准

速查表¶

场景	首选方法	备选方法
快速探索	AddModuleScore	UCell
稀疏数据优化	UCell	AUCell
有调控网络	decoupler ULM	decoupler MLM
伪批量后	GSVA	ssGSEA
空间数据	VISION	UCell+空间坐标