featureCounts 基因计数¶

featureCounts 是 Subread 套件中的读长计数工具，能把比对后的 BAM 文件转化为基因表达矩阵，速度极快（百万级读长秒级处理），是 RNA-seq 分析中从比对到差异分析的关键桥梁。

核心知识点¶

知识点	说明
工具定位	将比对后的 reads 分配（assign）到基因/外显子等基因组特征上并计数
所属套件	Subread（包含 subread-align、subjunc、featureCounts 等）
最新版本	Subread v2.1.1 / Rsubread v2.22.1（2025.04）
核心优势	极快（每分钟处理数百万读长）、支持多种注释格式
输入文件	BAM/SAM 文件 + 基因注释文件（GTF/GFF/SAF）
输出文件	每个基因对应的读长计数表（可直接用于 DESeq2/edgeR）
注释格式	GTF（标准格式，Ensembl/GENCODE 提供）或 SAF（简化格式）

安装配置¶

方法一：Conda 安装（推荐）¶

# 安装 subread 包（featureCounts 包含在内）
conda install -c bioconda subread  # 安装 subread 套件
featureCounts -v                    # 查看版本

方法二：R 包安装（Rsubread）¶

# 在 R 中安装 Rsubread
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")  # 安装 BiocManager
BiocManager::install("Rsubread")     # 安装 Rsubread

方法三：下载预编译包¶

# 从 SourceForge 下载
wget https://sourceforge.net/projects/subread/files/subread-2.1.1/subread-2.1.1-Linux-x86_64.tar.gz
tar -xzf subread-2.1.1-Linux-x86_64.tar.gz  # 解压
export PATH=$PWD/subread-2.1.1-Linux-x86_64/bin:$PATH  # 添加到 PATH

基本使用¶

1. 单个样本计数¶

# 基本用法：对单个 BAM 文件计数
featureCounts \
  -a genes.gtf \               # 基因注释文件（GTF 格式）
  -o counts.txt \               # 输出文件名
  aligned.bam                   # 输入的 BAM 文件

# 输出文件：
# counts.txt          基因计数表
# counts.txt.summary  计数统计摘要（多少 reads 被分配、多少失败等）

2. 双端数据计数（最常用）¶

# 双端 RNA-seq 数据按片段（fragment）计数
featureCounts \
  -p --countReadPairs \        # -p 表示双端数据，--countReadPairs 按片段计数
  -T 8 \                       # 使用 8 个线程
  -t exon \                    # 按外显子特征计数（默认值）
  -g gene_id \                 # 按 gene_id 汇总（默认值）
  -a genes.gtf \               # GTF 注释文件
  -o counts.txt \              # 输出文件
  sample.sorted.bam            # 输入 BAM

3. 批量多样本计数（一次搞定）¶

# featureCounts 可以同时处理多个 BAM 文件
# 输出一个包含所有样本的计数矩阵
featureCounts \
  -p --countReadPairs \        # 双端数据
  -T 16 \                      # 16 线程
  -t exon \                    # 特征类型
  -g gene_id \                 # 汇总属性
  -a genes.gtf \               # 注释文件
  -o all_counts.txt \          # 输出计数矩阵
  aligned/*.sorted.bam         # 所有 BAM 文件（通配符）

# 输出是一个大矩阵：行 = 基因，列 = 样本

4. 查看输出结果¶

# 查看计数统计摘要
cat all_counts.txt.summary    # 每个样本分配成功/失败的读长数

# 查看计数矩阵（跳过注释列，只看计数）
cut -f1,7- all_counts.txt | head  # 第1列是基因名，第7列开始是各样本计数

高级用法¶

1. 链特异性参数设置¶

# 根据你的文库建库方式设置链特异性
featureCounts \
  -p --countReadPairs \
  -s 0 \                       # 0=非链特异, 1=正链特异, 2=反链特异(dUTP法)
  -a genes.gtf \
  -o counts.txt \
  sample.bam

# 如何判断用 0、1 还是 2？
# 大多数 Illumina stranded 建库用 -s 2（dUTP 法，如 TruSeq Stranded）
# 如果不确定，可以三种都试，看哪个分配最多
# 或者用 infer_experiment.py（RSeQC 工具）自动判断

2. 处理多重比对读长¶

# 默认只计数唯一比对的读长
# 如果要包含多重比对读长：
featureCounts \
  -p --countReadPairs \
  -M \                         # 计数多重比对的读长
  --fraction \                 # 按权重分配（1/N，N 是比对位置数）
  -a genes.gtf \
  -o counts_multi.txt \
  sample.bam

3. 在 R 中使用 Rsubread¶

library(Rsubread)  # 加载 Rsubread 包

# 基本计数
fc <- featureCounts(
  files = c("sample1.bam", "sample2.bam", "sample3.bam"),  # BAM 文件
  annot.ext = "genes.gtf",           # GTF 注释文件
  isGTFAnnotationFile = TRUE,        # 是 GTF 格式
  GTF.featureType = "exon",          # 特征类型
  GTF.attrType = "gene_id",          # 汇总属性
  isPairedEnd = TRUE,                # 双端数据
  countReadPairs = TRUE,             # 按片段计数
  nthreads = 8                       # 线程数
)

# 提取计数矩阵
counts_matrix <- fc$counts           # 计数矩阵
annotation <- fc$annotation          # 基因注释信息
stat <- fc$stat                      # 分配统计

# 查看前几行
head(counts_matrix)

4. 使用 SAF 格式注释¶

# SAF 格式比 GTF 更简单，只需 5 列
# GeneID  Chr  Start  End  Strand
# 适合自定义区域（如启动子、增强子）

# 创建 SAF 文件示例
echo -e "GeneID\tChr\tStart\tEnd\tStrand" > custom_regions.saf
echo -e "region1\tchr1\t1000\t2000\t+" >> custom_regions.saf
echo -e "region2\tchr1\t5000\t6000\t-" >> custom_regions.saf

# 用 SAF 格式计数
featureCounts \
  -F SAF \                     # 指定 SAF 格式
  -a custom_regions.saf \      # SAF 注释文件
  -o counts_custom.txt \       # 输出
  sample.bam

5. 整理输出给 DESeq2¶

# 把 featureCounts 输出整理成 DESeq2 可用的格式
# 去掉前 6 列注释，只保留基因名和计数
cut -f1,7- all_counts.txt | tail -n +2 > counts_for_deseq2.txt
# cut -f1,7-  → 提取第1列和第7列以后的列
# tail -n +2  → 去掉第一行（列名注释行）

常见报错与解决¶

报错信息	原因	解决方法
`Unassigned_NoFeatures` 比例很高	注释文件和基因组版本不匹配	确保 GTF 和参考基因组版本一致
`Unassigned_Ambiguity` 比例高	读长跨越多个基因	正常现象，或用 `--fracOverlap 0.5` 过滤
`reads are not paired`	单端数据用了 `-p` 参数	去掉 `-p --countReadPairs`
计数全为 0	染色体命名不一致（chr1 vs 1）	统一 GTF 和 BAM 的染色体命名
`WARNING: No features found`	-t 参数设错了	GFF3 可能需要 `-t gene` 而不是 `-t exon`
`incompatible with the library`	链特异性参数 -s 设错	试试 `-s 0`, `-s 1`, `-s 2` 看哪个分配最多

速查表¶

# ===== featureCounts 速查表 =====

# 安装
conda install -c bioconda subread     # 安装 subread 套件

# 单端计数
featureCounts -a genes.gtf -o counts.txt sample.bam

# 双端计数（最常用）
featureCounts -p --countReadPairs \
  -T 8 -t exon -g gene_id \
  -a genes.gtf -o counts.txt sample.bam

# 链特异性计数（dUTP 建库）
featureCounts -p --countReadPairs \
  -s 2 -T 8 \
  -a genes.gtf -o counts.txt sample.bam

# 批量多样本
featureCounts -p --countReadPairs \
  -T 16 -a genes.gtf \
  -o all_counts.txt *.sorted.bam

# 多重比对按权重计数
featureCounts -p --countReadPairs \
  -M --fraction \
  -a genes.gtf -o counts.txt sample.bam

# 常用参数
# -p --countReadPairs   双端数据按片段计数
# -T N                  线程数
# -t exon               特征类型（GTF 用 exon，GFF3 可能用 gene）
# -g gene_id            汇总属性
# -s 0/1/2              链特异性（0=无, 1=正链, 2=反链）
# -M                    包含多重比对
# --fraction            多重比对按权重
# -C                    不计数嵌合片段
# --extraAttributes     输出额外属性列
# -F SAF/GTF            注释格式