SCENIC 转录因子网络 — 单细胞基因调控网络推断工具¶

一句话说明¶

SCENIC+（v1.0）是 Python 版的基因调控网络推断工具，联合 scRNA-seq 和 scATAC-seq 数据，鉴别驱动细胞命运的关键转录因子（TF）及其靶基因调控网络（regulon）。

安装与配置¶

# 创建专用 conda 环境（Python 3.11，避免依赖冲突）
conda create -n scenicplus python=3.11 -y
conda activate scenicplus

# 安装 SCENIC+（从 GitHub 安装最新版）
pip install git+https://github.com/aertslab/scenicplus

# 安装 pySCENIC（纯 scRNA-seq 的经典 SCENIC，更稳定）
pip install pyscenic

# 安装必要依赖
pip install scanpy anndata loompy dask

# 安装 arboreto（分布式 GRN 推断）
pip install arboreto

# 验证安装
python -c "import pyscenic; print(pyscenic.__version__)"

核心用法¶

# ── 经典 pySCENIC 三步流程（只需要 scRNA-seq 数据）──────

# 步骤1：GRN 推断（GENIE3/GRNBoost2，找转录因子的候选靶基因）
pyscenic grn \
  --num_workers 8 \                        # 8 个并行 worker
  --output adj.csv \                       # 输出邻接矩阵（TF-基因对）
  --method grnboost2 \                     # GRNBoost2 比 GENIE3 快 10 倍
  expr_matrix.loom \                       # 输入表达矩阵（loom 格式）
  tf_list.txt                              # 转录因子列表（人或鼠的 TF 名单）

# 步骤2：Regulon 鉴别（RcisTarget，基于 cis-regulatory 模体富集分析）
pyscenic ctx \
  adj.csv \                                # 上一步的邻接矩阵
  database.feather \                       # 顺式调控数据库（从 SCENIC 官网下载）
  --annotations_fname motif_annotations.tbl \  # 模体注释文件
  --expression_mtx_fname expr_matrix.loom \    # 表达矩阵（验证靶基因表达）
  --mode "dask_multiprocessing" \          # 多进程加速
  --output regulons.csv \                  # 输出 regulon 列表
  --num_workers 8

# 步骤3：AUCell（计算每个细胞的 regulon 激活分数）
pyscenic aucell \
  expr_matrix.loom \                       # 表达矩阵
  regulons.csv \                           # Regulon 列表
  --output auc_mtx.csv                     # 输出 AUC 矩阵（细胞 × regulon）

参数详解¶

参数	步骤	说明
`--num_workers`	grn/ctx	并行进程数（CPU 核数）
`--method`	grn	GRN 方法（`grnboost2` 推荐，`genie3` 更准但慢）
`--thresholds`	ctx	模体富集阈值
`--auc_threshold`	aucell	AUC 阈值（默认 0.05）
`--num_workers`	aucell	并行 worker 数量

实战案例¶

import scanpy as sc             # Scanpy 单细胞分析
import pandas as pd             # 数据处理
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns           # 热图可视化

# ── 数据准备 ──────────────────────────────────────────
adata = sc.read_h5ad('annotated.h5ad')

# 将 AnnData 转为 loom 格式（pySCENIC 的输入格式）
import loompy
adata.write_loom('expr_matrix.loom')   # 保存为 loom 文件

print("loom 文件保存完成，准备运行 pySCENIC！")
print(f"  细胞数：{adata.n_obs}")
print(f"  基因数：{adata.n_vars}")

# 下载必要数据库文件（人类 hg38）
# 从 cisTarget 数据库官网下载：https://resources.aertslab.org/cistarget/

# 下载 TF 列表
wget https://raw.githubusercontent.com/aertslab/pySCENIC/master/resources/hs_hgnc_tfs.txt

# 下载 cisTarget 数据库（hg38，只下一个就够）
# 注意：文件很大（~7GB），需要稳定网络
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather

# 下载模体注释文件
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl

# ── 读取并分析 SCENIC 结果 ────────────────────────────

# 读取 AUCell 结果（细胞 × regulon 的激活分数矩阵）
auc_mtx = pd.read_csv('auc_mtx.csv', index_col=0)

# 将 AUC 矩阵加入 AnnData 对象
# 规范化列名（去掉 "(+)" 等后缀）
auc_mtx.columns = [col.replace('(+)', '') for col in auc_mtx.columns]

# 添加到 AnnData（每个 regulon 对应一个 TF 的激活分数）
adata.obsm['X_aucell'] = auc_mtx.loc[adata.obs_names].values  # 按细胞顺序对齐
adata.uns['regulon_names'] = auc_mtx.columns.tolist()

# 可视化 TF 激活分数的 UMAP 图
sc.pp.neighbors(adata, use_rep='X_aucell')    # 基于 regulon 激活分数建图
sc.tl.umap(adata)                              # UMAP 降维

# 展示特定 TF（例如 FOXP3 对应 Treg，MYC 对应增殖）
sc.pl.umap(
    adata,
    color=['FOXP3', 'MYC', 'cell_type'],      # 同时展示 TF 和细胞类型
    vmin=0,
    frameon=False
)

# 热图：各细胞类型的 top TF 激活分数
# 计算各细胞类型的平均 AUC 分数
ct_auc = pd.DataFrame(
    auc_mtx.values,
    index=adata.obs_names,
    columns=auc_mtx.columns
)
ct_auc['cell_type'] = adata.obs['cell_type'].values
ct_mean = ct_auc.groupby('cell_type').mean()   # 各类型平均 regulon 活性

# 选择方差最大的 30 个 TF（最具区分度的 regulon）
top_regulons = ct_mean.var().nlargest(30).index.tolist()

# 绘制热图
plt.figure(figsize=(12, 6))
sns.heatmap(
    ct_mean[top_regulons].T,        # 转置：行=TF，列=细胞类型
    cmap='viridis',                 # 颜色方案
    xticklabels=True,
    yticklabels=True,
    cbar_kws={'label': 'AUC score'}
)
plt.title('Top Regulon Activity per Cell Type')
plt.tight_layout()
plt.savefig('scenic_heatmap.pdf', dpi=150, bbox_inches='tight')

常见报错与解决¶

报错	原因	解决方法
`FeatherError: file not found`	数据库文件路径错误	检查数据库文件路径是否正确
`No TF found`	TF 列表与基因名不匹配	确认是 HGNC 基因符号（大写）
`grn` 步骤很慢（>24h）	单线程运行	增加 `--num_workers 16`
AUC 矩阵细胞名不匹配	barcode 格式差异	标准化 barcode 格式再合并
内存不足（OOM）	数据库太大	减小 `ctx` 的 chunk_size

速查表¶

# 三步走
pyscenic grn --num_workers 8 --output adj.csv expr.loom tf_list.txt
pyscenic ctx adj.csv db.feather --annotations_fname motifs.tbl \
          --expression_mtx_fname expr.loom --output regulons.csv --num_workers 8
pyscenic aucell expr.loom regulons.csv --output auc_mtx.csv

# 输出文件
# adj.csv      → TF-靶基因关联强度
# regulons.csv → 经验证的 regulon 列表
# auc_mtx.csv  → 每个细胞每个 regulon 的激活分数（核心结果）