宏基因组抗性基因分型¶

一句话概述¶

利用CARD-RGI、StarAMR、ResFinder等工具从宏基因组数据中鉴定抗生素抗性基因（ARG），进行抗性基因分型、定量和流行病学追踪，为抗菌药物耐药性监测提供生信解决方案。

核心知识点总览¶

知识点	关键内容	重要程度
CARD数据库	抗性基因综合数据库/ARO本体	⭐⭐⭐⭐⭐
RGI工具	CARD官方抗性基因鉴定工具	⭐⭐⭐⭐⭐
StarAMR	整合ResFinder/PointFinder/PlasmidFinder	⭐⭐⭐⭐
ResFinder	获得性抗性基因检测	⭐⭐⭐⭐
宏基因组ARG定量	reads mapping/RPKM归一化	⭐⭐⭐⭐
移动元件分析	ARG与可移动遗传元件(MGE)共定位	⭐⭐⭐
抗性组(Resistome)	环境/临床样本抗性基因全景	⭐⭐⭐
流行病学应用	AMR监测/One Health框架	⭐⭐⭐

各步骤详解¶

第一步：理解抗生素耐药性与抗性基因组学¶

白话解释： 抗生素耐药性（AMR）是全球公共卫生最严重的威胁之一。细菌通过多种机制抵抗抗生素：产酶降解、外排泵排出、靶点修改等。这些能力编码在抗性基因（ARG）中。宏基因组ARG分析就是从环境或临床样本的总DNA中，找出所有的抗性基因，了解耐药的"全景"——这被称为"抗性组"（resistome）。

技术细节： AMR机制分类： 1. 抗生素灭活：产β-内酰胺酶（如NDM, KPC）降解抗生素 2. 靶点修饰：突变改变药物结合靶点（如rpoB突变致利福平耐药） 3. 外排泵：主动将药物泵出细胞（如AcrAB-TolC） 4. 靶点保护：产蛋白质保护靶点（如tet(M)保护核糖体） 5. 通透性降低：孔蛋白突变减少药物进入（如OprD缺失）

核心数据库CARD（The Comprehensive Antibiotic Resistance Database）使用ARO（Antibiotic Resistance Ontology）对抗性基因进行分类注释。

# CARD数据库结构
# ARO (Antibiotic Resistance Ontology) - 基因分类系统
# - Drug class: 抗药物类别
# - Resistance mechanism: 耐药机制
# - AMR gene family: 基因家族
# - Target: 药物靶点

# 下载最新CARD数据库（当前数据版本4.0.1，2025年2月更新）
wget https://card.mcmaster.ca/latest/data
tar -xvf data ./card.json

# 安装RGI工具（当前版本6.0.5，推荐conda安装）
# conda install -c bioconda rgi  # 推荐方式
pip install rgi  # 或pip安装
rgi auto_load --card_json card.json

第二步：宏基因组数据预处理¶

白话解释： 宏基因组数据中混合了大量宿主DNA（如人类样本）和各种微生物的DNA。在进行ARG分析之前，需要去除宿主序列、质控、以及可选的组装步骤。可以选择基于reads的方法（直接比对）或基于组装的方法（先组装再注释）。

技术细节：

# === 宏基因组预处理 ===

# 1. 质控
fastp -i sample_R1.fq.gz -I sample_R2.fq.gz \
    -o clean_R1.fq.gz -O clean_R2.fq.gz \
    --detect_adapter_for_pe --thread 16

# 2. 去宿主序列（人类样本）
bowtie2 -x human_genome -1 clean_R1.fq.gz -2 clean_R2.fq.gz \
    --threads 16 --very-sensitive \
    --un-conc-gz nonhost_R%.fq.gz \
    -S /dev/null

# 3. 宏基因组组装（可选，用于contig-level分析）
megahit -1 nonhost_R1.fq.gz -2 nonhost_R2.fq.gz \
    -o megahit_assembly --min-contig-len 500 \
    --num-cpu-threads 32 --memory 0.9

# 4. 基因预测（组装后）
prodigal -i megahit_assembly/final.contigs.fa \
    -o genes.gff -a proteins.faa -d genes.fna \
    -p meta -f gff

第三步：RGI抗性基因鉴定¶

白话解释： RGI（Resistance Gene Identifier）是CARD数据库的官方分析工具。它可以直接对组装好的contigs或预测的蛋白序列进行注释，找出其中的抗性基因。RGI使用三种检测模型：完美匹配（Perfect）、严格匹配（Strict）和松散匹配（Loose），覆盖从高置信度到探索性的不同需求。

技术细节： RGI检测范例（detection paradigm）： - Perfect：序列与参考100%一致——确定的已知ARG - Strict：基于bit-score阈值的高置信检测——推断的ARG - Loose：低于strict阈值但仍有同源性——潜在的ARG（用于发现新变体）

# === RGI 分析 ===

# 1. 基于Contig的RGI分析（核酸序列）
rgi main \
    --input_sequence megahit_assembly/final.contigs.fa \
    --output_file sample_rgi \
    --input_type contig \
    --alignment_tool BLAST \
    --num_threads 16 \
    --clean \
    --include_loose

# 2. 基于蛋白序列的RGI分析
rgi main \
    --input_sequence proteins.faa \
    --output_file sample_rgi_protein \
    --input_type protein \
    --alignment_tool DIAMOND \
    --num_threads 16

# 3. 基于Reads的RGI bwt分析（不需组装）
rgi bwt \
    --read_one nonhost_R1.fq.gz \
    --read_two nonhost_R2.fq.gz \
    --output_file sample_rgi_bwt \
    --aligner bowtie2 \
    --threads 16 \
    --include_wildcard

# 4. 结果解析
# 输出文件: sample_rgi.txt (tab-delimited)
# 关键列: Best_Hit_ARO, Drug Class, Resistance Mechanism,
#         AMR Gene Family, Cut_Off (Perfect/Strict/Loose)

# RGI结果解析与统计
import pandas as pd
import matplotlib.pyplot as plt

# 读取RGI结果
rgi_results = pd.read_csv("sample_rgi.txt", sep="\t")

# 过滤只保留 Perfect + Strict hits
confident_hits = rgi_results[rgi_results['Cut_Off'].isin(['Perfect', 'Strict'])]

# 按Drug Class统计
drug_class_counts = confident_hits['Drug Class'].str.split('; ').explode().value_counts()
print("Top 10 Drug Classes:")
print(drug_class_counts.head(10))

# 按Resistance Mechanism统计
mech_counts = confident_hits['Resistance Mechanism'].value_counts()
print("\nResistance Mechanisms:")
print(mech_counts)

# 按AMR Gene Family统计
family_counts = confident_hits['AMR Gene Family'].value_counts()
print("\nTop 10 AMR Gene Families:")
print(family_counts.head(10))

第四步：StarAMR整合分析¶

白话解释： StarAMR是加拿大公共卫生署开发的整合工具，一站式完成抗性基因检测（ResFinder）、点突变耐药检测（PointFinder）和质粒鉴定（PlasmidFinder）。特别适合临床分离株的快速AMR分型。

技术细节： StarAMR整合了三个来自DTU（丹麦技术大学）的数据库： - ResFinder：获得性抗性基因（通过水平基因转移获得） - PointFinder：染色体突变导致的耐药 - PlasmidFinder：质粒replicon鉴定

# === StarAMR 安装与运行 ===

# 安装
pip install staramr
staramr db build --dir staramr_db

# 运行（输入为组装的contigs或完整基因组）
staramr search \
    --database staramr_db \
    --output-dir staramr_results \
    --pid-threshold 90 \
    --percent-length-overlap-resfinder 60 \
    --nprocs 16 \
    assembly1.fasta assembly2.fasta assembly3.fasta

# 输出文件说明
# resfinder.tsv     - 获得性ARG检测结果
# pointfinder.tsv   - 点突变耐药
# plasmidfinder.tsv - 质粒replicon
# mlst.tsv          - MLST分型
# summary.tsv       - 综合汇总
# detailed_summary.tsv - 详细汇总

# 查看结果
cat staramr_results/summary.tsv

第五步：ResFinder独立使用¶

白话解释： ResFinder专注于检测获得性抗性基因——即通过质粒、转座子等水平基因转移方式获得的耐药基因。这些基因是耐药传播最令人担忧的部分，因为它们可以在不同细菌之间快速传递。

技术细节：

# === ResFinder 命令行使用（当前版本4.7.2，2025年4月更新）===
pip install resfinder

# 运行 ResFinder
python -m resfinder \
    --inputfasta assembly.fasta \
    --outputPath resfinder_output \
    --acquired \
    --species "Escherichia coli" \
    --min_cov 0.6 \
    --threshold 0.9 \
    --db_path_res resfinder_db

# 同时检测点突变（需指定物种）
python -m resfinder \
    --inputfasta assembly.fasta \
    --outputPath resfinder_output \
    --acquired --point \
    --species "Escherichia coli" \
    --db_path_res resfinder_db \
    --db_path_point pointfinder_db

# ResFinder 也支持reads输入
python -m resfinder \
    --inputfastq reads_R1.fq.gz reads_R2.fq.gz \
    --outputPath resfinder_reads_output \
    --acquired \
    --species "Klebsiella pneumoniae"

第六步：宏基因组ARG定量分析¶

白话解释： 仅仅知道有哪些ARG存在还不够，我们还想知道它们的丰度有多高——不同样本间ARG的丰度差异往往更有意义。这需要把reads比对到ARG数据库，然后做适当的归一化。

技术细节： ARG定量归一化策略： - RPKM：考虑基因长度和测序深度 - copies per 16S：用16S rRNA基因作为内参归一化 - TPM：更适合跨样本比较 - ARG copies per cell：结合基因组当量估算每个细胞的ARG拷贝数

# === ARG定量（reads-based）===

# 方法1：比对到CARD序列数据库
# 构建比对索引
bowtie2-build card_nucleotide_seqs.fasta card_index

# 比对
bowtie2 -x card_index \
    -1 nonhost_R1.fq.gz -2 nonhost_R2.fq.gz \
    --threads 16 --very-sensitive \
    -S sample_vs_card.sam

# 计算覆盖度和丰度
samtools view -bS -F 4 sample_vs_card.sam | samtools sort > sample_vs_card.sorted.bam
samtools index sample_vs_card.sorted.bam
samtools idxstats sample_vs_card.sorted.bam > sample_card_counts.txt

# 方法2：使用ARGs-OAP流程（推荐宏基因组）
# ARGs-OAP = Antibiotic Resistance Genes Online Analysis Pipeline
# https://github.com/xiaole99/ARGs-OAP-v3.0

# 方法3：ShortBRED（高特异性短标记序列）
shortbred_identify.py \
    --goi card_proteins.faa \
    --ref uniref90.fasta \
    --markers card_markers.faa \
    --tmp tmp_identify

shortbred_quantify.py \
    --markers card_markers.faa \
    --wgs nonhost_R1.fq.gz \
    --results sample_arg_shortbred.tsv \
    --tmp tmp_quantify \
    --threads 16

# ARG丰度归一化
import pandas as pd
import numpy as np

# 读取比对计数
counts = pd.read_csv("sample_card_counts.txt", sep="\t",
                      header=None, names=['gene', 'length', 'mapped', 'unmapped'])
counts = counts[counts['gene'] != '*']

# 总reads数
total_reads = counts['mapped'].sum() + counts['unmapped'].sum()

# 计算RPKM
counts['RPKM'] = (counts['mapped'] * 1e9) / (counts['length'] * total_reads)

# 计算copies per 16S
# 假设已知16S比对reads数
reads_16s = 50000  # 比对到16S的reads数
gene_16s_length = 1500  # 16S长度

copies_16s = (reads_16s * 1e3) / gene_16s_length
counts['copies_per_16S'] = (counts['mapped'] / counts['length'] * 1e3) / copies_16s

# 按药物类别汇总
# 需要ARO注释文件关联gene -> drug class
aro_mapping = pd.read_csv("aro_categories.tsv", sep="\t")
merged = counts.merge(aro_mapping, on='gene')
drug_class_abundance = merged.groupby('Drug Class')['RPKM'].sum().sort_values(ascending=False)

第七步：抗性组可视化与比较分析¶

白话解释： 当有多个样本（如不同环境、不同时间点、治疗前后）时，需要对抗性组进行比较分析：哪些样本的ARG总丰度更高？哪些药物类别的耐药基因富集？不同样本的ARG组成有何差异？

技术细节：

# === R语言ARG可视化 ===
library(ggplot2)
library(pheatmap)
library(vegan)

# 读取多样本ARG丰度矩阵
arg_matrix <- read.csv("arg_abundance_matrix.csv", row.names = 1)
# 行=ARG, 列=样本

# 热图展示ARG丰度
pheatmap(log2(arg_matrix + 1),
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean",
         show_rownames = FALSE,
         annotation_col = sample_meta)

# ARG多样性分析（与微生物多样性类似）
arg_diversity <- diversity(t(arg_matrix), index = "shannon")
arg_richness <- specnumber(t(arg_matrix))

# PCoA分析
arg_dist <- vegdist(t(arg_matrix), method = "bray")
pcoa_result <- cmdscale(arg_dist, k = 2, eig = TRUE)
pcoa_df <- data.frame(
  PC1 = pcoa_result$points[, 1],
  PC2 = pcoa_result$points[, 2],
  Group = sample_meta$group
)
ggplot(pcoa_df, aes(x = PC1, y = PC2, color = Group)) +
  geom_point(size = 3) +
  stat_ellipse() +
  theme_minimal()

# PERMANOVA检验组间差异
adonis2(arg_dist ~ Group, data = sample_meta)

实战命令速查¶

# === 快速ARG分型流水线 ===

# 单基因组/分离株
rgi main -i genome.fasta -o rgi_result -t contig -a DIAMOND -n 8
staramr search -o star_result genome.fasta

# 宏基因组（基于reads）
rgi bwt --read_one R1.fq.gz --read_two R2.fq.gz \
    --output_file meta_rgi --aligner bowtie2 --threads 16

# 宏基因组（基于组装）
megahit -1 R1.fq.gz -2 R2.fq.gz -o assembly
prodigal -i assembly/final.contigs.fa -a proteins.faa -p meta
rgi main -i proteins.faa -o assembly_rgi -t protein -a DIAMOND -n 16

# 数据库更新
rgi auto_load --card_json /path/to/card.json
staramr db build --dir /path/to/staramr_db

面试常问点¶

Q1: CARD数据库中Perfect/Strict/Loose hit的区别？¶

A: Perfect hit表示序列与参考完全一致（包含所有已知的耐药决定残基），是确认的已知ARG。Strict hit通过bit-score阈值检测，序列有一定变异但仍高置信为ARG，适用于检测已知基因的新变体。Loose hit低于strict阈值但仍有同源性，可能是远缘同源物或新的潜在ARG。实际分析中通常只报告Perfect+Strict，Loose用于探索性研究。

Q2: 获得性耐药基因和点突变耐药的区别？¶

A: 获得性ARG是通过水平基因转移（HGT）获得的完整基因（如质粒携带的blaNDM），可用ResFinder/RGI通过序列比对检测。点突变耐药是染色体基因的特定位点突变导致耐药（如gyrA S83L导致喹诺酮耐药），需要PointFinder通过突变比对检测。前者传播性强，后者遗传稳定。

Q3: 宏基因组ARG分析中reads-based和assembly-based方法的优劣？¶

A: Reads-based方法（如RGI bwt）：无需组装，可检测低丰度ARG，计算快，但可能有短序列匹配的假阳性。Assembly-based方法：可获得完整基因和基因组context（如邻近的MGE），检测更准确，但低丰度基因可能无法组装出来。推荐两种方法互补使用。

Q4: 如何评估ARG是否在可移动遗传元件上？¶

A: 关键方法：(1) 检查ARG所在contig是否包含MGE标记（转座酶、整合酶、IS元件）；(2) 使用PlasmidFinder/PlasFlow判断contig是否来自质粒；(3) 使用ICEfinder检测整合型接合元件；(4) 分析ARG基因组context（上下游基因）。ARG与MGE的共定位提示传播风险高。

Q5: 如何归一化宏基因组中的ARG丰度以便跨样本比较？¶

A: 常用方法：(1) RPKM/FPKM/TPM：校正基因长度和测序深度；(2) copies per 16S rRNA gene：以16S作为细菌细胞数的代理指标，得到"每个细胞的ARG拷贝数"；(3) 相对丰度：ARG reads占总metagenome reads的比例。推荐copies per 16S，因为它有明确的生物学意义。

Q6: One Health框架下AMR监测的意义？¶

A: One Health强调人-动物-环境的AMR互联。宏基因组ARG分析使我们能追踪：(1) 临床耐药基因是否来源于养殖环境；(2) 污水处理是否能有效去除ARG；(3) 同一ARG在不同生态位间的传播链路。这需要标准化的采样、分析流程和数据共享框架。

易错点¶

1. Loose hit当作确认的ARG报告¶

RGI默认包含Loose hit，但这些是低置信度预测，可能是同源但无耐药功能的基因。发表文章应只报告Perfect+Strict级别，或明确说明包含了Loose hit及其生物学意义。

2. 忽略基因长度覆盖度阈值¶

一个ARG只有30%被覆盖也被报告为阳性，但实际上可能只是碎片匹配。应设置最小覆盖度阈值（如>60%长度覆盖），确保检测的ARG是较完整的。

3. 物种信息缺失导致点突变分析失败¶

PointFinder需要指定物种才能确定哪些突变是耐药相关的（如大肠杆菌gyrA S83L）。宏基因组数据中物种混合，直接用PointFinder效果有限，更适合单分离株分析。

4. 数据库版本不一致¶

不同版本的CARD/ResFinder数据库基因ID和分类可能变化。跨研究比较必须使用相同版本的数据库，或提供版本号以便复现。

5. 将检测到ARG等同于表型耐药¶

基因存在不等于表达，不等于耐药表型。基因可能被沉默、截断或处于不适合的宿主中。ARG检测结果需结合药敏试验（AST）验证。

补充知识¶

其他ARG数据库与工具¶

AMRFinderPlus（NCBI）：NCBI官方AMR注释工具，整合HMM搜索与BLAST比对，可同时检测ARG、点突变、毒力因子和应激抗性基因（数据库版本4.0，2025年4月）
MEGARES：宏基因组ARG定量优化的数据库
ARDB：已停止更新，被CARD取代
SARG：Structured ARG database，按层级分类
deepARG：基于深度学习的ARG预测

前沿方向¶

长读段ARG分析：Nanopore/PacBio能直接读取ARG基因组context和所在质粒
抗性基因表达分析：宏转录组（metatranscriptomics）确认ARG是否活跃表达
耐药突变实时监测：基于测序的医院耐药实时监测系统
机器学习预测：CARD 2023已引入机器学习支持，用于从序列预测新ARG和耐药表型
真菌耐药数据库：FungAMR（Bedard et al., Nature Microbiology, 2025）扩展了CARD到真菌抗菌素耐药领域

引用推荐¶

CARD/RGI: Alcock et al., Nucleic Acids Research, 2023（CARD 2023，当前数据版本4.0.1，RGI版本6.0.5）
ResFinder 4.0: Bortolaia et al., Journal of Antimicrobial Chemotherapy, 2020（当前软件版本4.7.2，2025年4月更新）
StarAMR: GitHub - phac-nml/staramr
ARGs-OAP: Yin et al., Bioinformatics, 2018
AMRFinderPlus: Feldgarden et al., Scientific Reports, 2021（NCBI官方，数据库版本4.0，2025年4月更新）

宏基因组抗性基因分型¶

一句话概述¶

核心知识点总览¶

各步骤详解¶

第一步：理解抗生素耐药性与抗性基因组学¶

第二步：宏基因组数据预处理¶

第三步：RGI抗性基因鉴定¶

第四步：StarAMR整合分析¶

第五步：ResFinder独立使用¶

第六步：宏基因组ARG定量分析¶

第七步：抗性组可视化与比较分析¶

实战命令速查¶

面试常问点¶

Q1: CARD数据库中Perfect/Strict/Loose hit的区别？¶

Q2: 获得性耐药基因和点突变耐药的区别？¶

Q3: 宏基因组ARG分析中reads-based和assembly-based方法的优劣？¶

Q4: 如何评估ARG是否在可移动遗传元件上？¶

Q5: 如何归一化宏基因组中的ARG丰度以便跨样本比较？¶

Q6: One Health框架下AMR监测的意义？¶

易错点¶

1. Loose hit当作确认的ARG报告¶

2. 忽略基因长度覆盖度阈值¶

3. 物种信息缺失导致点突变分析失败¶

4. 数据库版本不一致¶

5. 将检测到ARG等同于表型耐药¶

补充知识¶

其他ARG数据库与工具¶

前沿方向¶

引用推荐¶

📚 相关文章推荐