DRAM — 宏基因组代谢功能注释与提炼工具¶

一句话说明¶

DRAM（Distilled and Refined Annotation of Metabolism）对宏基因组 MAG 进行系统代谢功能注释，识别代谢通路完整性，生成直观的代谢功能总结报告。

安装与配置¶

# DRAM 安装（建议使用 conda，依赖复杂）
conda create -n dram python=3.8 -y  # DRAM 对 Python 版本有要求
conda activate dram

# 从 bioconda 安装 DRAM（当前版本 v1.5.0）
conda install -c bioconda -c conda-forge dram -y

# 验证安装
DRAM.py --help                    # 查看主程序帮助

# 配置/下载数据库（关键步骤，数据库很大，约 60GB）
# 包含 KEGG、UniRef、Pfam、dbCAN、VOG、MEROPS 等数据库
DRAM-setup.py prepare_databases \
    --output_dir ~/databases/DRAM/ \
    --threads 16                  # 多线程加速下载

# 查看已配置的数据库
DRAM-setup.py print_config       # 显示所有数据库路径

核心用法¶

第一步：注释 MAG 基因组¶

# 注释单个基因组
# --input_fasta：输入 MAG fasta
# --output_dir：输出目录
# --min_contig_size：最短 contig 过滤
DRAM.py annotate \
    --input_fasta bin_01.fa \
    --output_dir dram_annot/bin_01/ \
    --min_contig_size 2500 \
    --threads 16                  # 线程数

批量注释多个 MAG¶

# 批量注释（DRAM 支持目录输入）
# --input_fasta 接受通配符
DRAM.py annotate \
    --input_fasta "das_tool_bins/*.fa" \
    --output_dir dram_annotations/ \
    --min_contig_size 2500 \
    --threads 32                  # 多核加速

第二步：提炼代谢功能摘要¶

# 从注释结果生成代谢功能摘要（最重要的输出）
# --input_files：注释结果的 annotations.tsv 文件列表
# --output_dir：摘要输出目录
DRAM.py distill \
    --input_files "dram_annotations/*/annotations.tsv" \
    --output_dir dram_distilled/ \
    --rrna_path "dram_annotations/*/rrnas.tsv" \
    --trna_path "dram_annotations/*/trnas.tsv"

# 结果包括：
# metabolism_summary.xlsx  — Excel 格式代谢功能摘要（含多个 sheet）
# product.html             — 交互式 HTML 可视化（最直观）

参数详解¶

参数	说明	默认值
`--input_fasta`	输入 MAG fasta（支持通配符）	必填
`--output_dir`	输出目录	必填
`--min_contig_size`	最短 contig（bp）	2500
`--threads`	线程数	10
`--low_activity_threshold`	低活性阈值	0.75
`--high_activity_threshold`	高活性阈值	0.9
`--skip_trnascan`	跳过 tRNA 预测（加速）	关闭
`--gtdb_taxonomy`	使用 GTDB 分类（distill 步骤）	可选

输出文件说明¶

ls dram_annotations/bin_01/
# annotations.tsv    — 基因功能注释主表（最重要）
# genes.faa          — 蛋白质序列
# genes.fna          — 基因核苷酸序列
# genes.gff          — GFF 格式注释
# scaffolds.fna      — contig 序列
# rrnas.tsv          — rRNA 预测结果
# trnas.tsv          — tRNA 预测结果

# annotations.tsv 重要列
# gene_id            — 基因 ID
# scaffold           — 所在 contig
# ko_id              — KEGG Orthology 编号
# kegg_hit           — KEGG 数据库比对结果
# pfam_hits          — Pfam 域命中
# cazy_hits          — CAZy 碳水化合物酶命中
# peptidase_family   — 蛋白酶家族（MEROPS）

ls dram_distilled/
# metabolism_summary.xlsx  — 代谢通路完整性热图（Excel）
# product.html             — 交互式 HTML 可视化报告

实战案例¶

# 场景：分析肠道宏基因组 MAG 的碳水化合物代谢能力

# 1. 批量注释所有高质量 MAG（completeness>50%，contamination<10%）
mkdir -p dram_out

DRAM.py annotate \
    --input_fasta "hq_mags/*.fa" \
    --output_dir dram_out/ \
    --min_contig_size 2500 \
    --threads 32

# 2. 生成代谢摘要
DRAM.py distill \
    --input_files "dram_out/*/annotations.tsv" \
    --output_dir dram_summary/ \
    --rrna_path "dram_out/*/rrnas.tsv" \
    --trna_path "dram_out/*/trnas.tsv"

# 3. 提取 CAZy 碳水化合物酶信息（肠道菌群分析重点）
awk -F'\t' '$0~/cazy/' dram_out/bin_01/annotations.tsv \
    | cut -f1,17 \                # 提取基因 ID 和 CAZy 命中列
    | sort | uniq -c              # 统计每种酶的数量

# 4. 统计各 MAG 的 KEGG 通路覆盖度
for dir in dram_out/*/; do
    mag=$(basename $dir)
    ko_count=$(awk -F'\t' 'NR>1 && $9!=""' ${dir}/annotations.tsv | wc -l)
    echo "${mag}: ${ko_count} 个有 KO 注释的基因"
done

常见报错与解决¶

报错信息	原因	解决方法
`Database not configured`	数据库未配置	运行 `DRAM-setup.py prepare_databases`
`No genes found`	contig 太短	降低 `--min_contig_size`
`Out of memory`	内存不足	减少并行任务，逐个 MAG 注释
`mmseqs error`	MMseqs2 版本问题	`conda update mmseqs2`
注释速度极慢	KEGG 数据库检索慢	正常，KEGG 检索耗时，等待即可

速查表¶

# 配置数据库（一次性，耗时几小时）
DRAM-setup.py prepare_databases --output_dir ~/db/DRAM/ --threads 16

# 单个 MAG 注释
DRAM.py annotate --input_fasta bin.fa --output_dir out/ --min_contig_size 2500 --threads 16

# 批量注释
DRAM.py annotate --input_fasta "bins/*.fa" --output_dir out/ --threads 32

# 生成代谢摘要（核心）
DRAM.py distill --input_files "out/*/annotations.tsv" --output_dir summary/