CellPhoneDB 配体受体 — 统计检验驱动的细胞通讯分析工具¶

一句话说明¶

CellPhoneDB v5 是基于 Python 的细胞通讯分析工具，通过置换检验统计方法鉴别细胞类型间显著的配体-受体互作，数据库覆盖超过 2000 对蛋白质互作，特别擅长发现多亚基复合体介导的信号。

安装与配置¶

# 推荐用 conda 创建独立环境（避免依赖冲突）
conda create -n cellphonedb python=3.10 -y
conda activate cellphonedb

# 安装 CellPhoneDB v5（最新稳定版）
pip install cellphonedb

# 安装可视化依赖
pip install ktplotspy plotnine matplotlib pandas numpy

# 验证安装
cellphonedb --help

# 下载数据库（首次使用必须执行）
cellphonedb database download   # 下载最新版 CellPhoneDB 数据库到 ~/.cpdb/

核心用法¶

# ── 方法1：命令行运行（最常用） ──────────────────────────

# 准备输入文件：
# 1. normalized_counts.tsv  →  行=基因，列=细胞，标准化后的表达矩阵
# 2. meta.tsv              →  两列：Cell,cell_type（细胞名称和类型）

# 运行统计分析（置换检验）
cellphonedb method statistical_analysis \
  meta.tsv \                           # 细胞类型注释文件
  normalized_counts.tsv \              # 标准化表达矩阵
  --output-path cpdb_results/ \        # 结果输出目录
  --threads 8 \                        # 并行线程数
  --iterations 1000 \                  # 置换次数（1000 足够，越多越准）
  --threshold 0.1 \                    # 基因在 cluster 中表达的最低比例
  --output-format tsv                  # 输出格式（tsv 或 csv）

# ── 方法2：Squidpy 接口（适合空间转录组） ────────────────
# 见 338_Squidpy 教程

参数详解¶

参数	说明	建议值
`--iterations`	置换检验次数（越多越准确但越慢）	1000
`--threshold`	基因表达阈值（比例，低于此过滤掉）	0.1
`--threads`	并行计算线程数	8
`--pvalue`	p 值阈值（用于后续过滤显著互作）	0.05
`--database`	自定义数据库路径	默认 `~/.cpdb/`
`--subsampling`	是否对大数据集降采样	细胞 >5 万时用

实战案例¶

import pandas as pd             # 数据处理
import numpy as np
import matplotlib.pyplot as plt
import ktplotspy as kt          # CellPhoneDB 专用可视化包

# ── 准备输入文件 ──────────────────────────────────────
import scanpy as sc
import anndata as ad

# 读取已注释的单细胞数据
adata = sc.read_h5ad('annotated.h5ad')

# 提取标准化后的表达矩阵（行=基因，列=细胞）
import scipy.sparse as sp
counts_df = pd.DataFrame(
    sp.csr_matrix.toarray(adata.X) if sp.issparse(adata.X) else adata.X,
    index=adata.obs_names,          # 细胞名称为行索引
    columns=adata.var_names         # 基因名称为列索引
).T                                 # 转置，让行=基因，列=细胞

# 保存标准化矩阵（CellPhoneDB 要求 log 标准化后的数据）
counts_df.to_csv('normalized_counts.tsv', sep='\t')

# 提取并保存元数据文件（必须包含 Cell 和 cell_type 两列）
meta_df = pd.DataFrame({
    'Cell': adata.obs_names,         # 细胞 barcode
    'cell_type': adata.obs['cell_type'].values  # 细胞类型注释
})
meta_df.to_csv('meta.tsv', sep='\t', index=False)

print("输入文件准备完成！")
print(f"  细胞数：{len(meta_df)}")
print(f"  基因数：{len(counts_df)}")
print(f"  细胞类型：{meta_df['cell_type'].unique().tolist()}")

# 在命令行运行 CellPhoneDB 分析
conda activate cellphonedb
cellphonedb method statistical_analysis \
  meta.tsv \
  normalized_counts.tsv \
  --output-path cpdb_results/ \
  --threads 8 \
  --iterations 1000 \
  --threshold 0.1

# ── 读取并可视化结果 ──────────────────────────────────
# CellPhoneDB 输出文件说明：
# pvalues.tsv          →  每对互作的 p 值
# means.tsv            →  每对互作的平均表达量
# significant_means.tsv → 显著互作（p<0.05）的均值

# 读取结果文件
pvalues = pd.read_csv('cpdb_results/pvalues.tsv', sep='\t', index_col=0)
means = pd.read_csv('cpdb_results/means.tsv', sep='\t', index_col=0)
signif_means = pd.read_csv('cpdb_results/significant_means.tsv', sep='\t', index_col=0)

# 查看显著互作数量
print(f"显著互作对数：{signif_means.shape[0]}")
print(f"分析的细胞类型对数：{signif_means.shape[1] - 11}")  # 前11列为元信息

# 使用 ktplotspy 可视化（推荐）
import ktplotspy as kt

# 点图：展示显著配体受体对
kt.plot_cpdb(
    adata=adata,
    cell_type1='T_cell',              # 发送信号的细胞类型（支持正则表达式）
    cell_type2='Tumor',               # 接受信号的细胞类型
    cpdb_file_path='cpdb_results/',   # CellPhoneDB 结果目录
    means_file_path='cpdb_results/means.tsv',
    pvalues_file_path='cpdb_results/pvalues.tsv',
    celltype_key='cell_type',         # adata.obs 中的细胞类型列名
    figsize=(10, 8),
    title='T_cell → Tumor 通讯'
)
plt.tight_layout()
plt.savefig('cpdb_dotplot.pdf', dpi=150, bbox_inches='tight')

# 热图：细胞类型对的通讯数量
kt.plot_cpdb_heatmap(
    pvals=pvalues,
    degs_analysis=False,              # 是否使用差异表达分析结果
    figsize=(8, 8),
    title='Interaction Count Heatmap'
)
plt.savefig('cpdb_heatmap.pdf', dpi=150, bbox_inches='tight')

# 查看特定信号通路（例如 MHC 相关互作）
mhc_interactions = signif_means[signif_means.index.str.contains('MHC')]
print(f"MHC 相关显著互作：{len(mhc_interactions)}")
print(mhc_interactions.iloc[:, :5].head())

常见报错与解决¶

报错	原因	解决方法
`Database not found`	数据库未下载	`cellphonedb database download`
`Column 'Cell' not found in meta`	元数据列名不匹配	确保元数据有 `Cell` 和 `cell_type` 列
运行极慢（>12h）	细胞数太多	加 `--subsampling --subsampling-num-cells 5000`
p 值全是 1.0	iterations 太少	增加 `--iterations 2000`
内存溢出	矩阵太大	先过滤掉低表达基因

速查表¶

# 安装
pip install cellphonedb ktplotspy

# 下载数据库
cellphonedb database download

# 运行分析（一行命令）
cellphonedb method statistical_analysis meta.tsv counts.tsv \
  --output-path out/ --threads 8 --iterations 1000

# 输出文件
# pvalues.tsv           ← 每对配体受体的 p 值
# means.tsv             ← 每对配体受体的均值
# significant_means.tsv ← 只包含 p<0.05 的显著互作

# Python 可视化
# pip install ktplotspy
# kt.plot_cpdb(adata, 'T_cell', 'B_cell', 'cpdb_results/')