631_微生物组机器学习pipeline¶

一句话概述： 微生物组机器学习通过Random Forest、XGBoost等算法从微生物丰度数据中寻找疾病生物标志物或预测疾病状态，关键在于正确的数据预处理（CLR转换）、严格的交叉验证（防止数据泄露）和合理的特征选择策略。

核心知识点速查表¶

概念	白话解释
Classification	分类——预测样本属于哪个组（如疾病vs对照）
Random Forest	随机森林——用很多决策树投票决定分类结果
XGBoost	极端梯度提升——迭代建树纠正前一棵树的错误
Cross-validation	交叉验证——数据分多份，轮流做训练和测试
Data leakage	数据泄露——测试数据的信息"泄露"到训练中，导致结果虚高
AUROC/AUC	受试者工作特征曲线下面积——分类性能指标，0.5=随机猜，1.0=完美
Feature importance	特征重要性——每个微生物对分类的贡献程度
CLR transformation	中心化对数比转换——组成数据的推荐转换方法
LOSO	Leave-One-Study-Out——留一研究交叉验证（防止批次效应泄露）

一、安装¶

# === Python环境配置 ===
# pip install scikit-learn xgboost pandas numpy matplotlib seaborn
# pip install scikit-bio  # 微生物组专用

import numpy as np           # 数值计算
import pandas as pd          # 数据处理
from sklearn.ensemble import RandomForestClassifier  # 随机森林
from sklearn.model_selection import (
    StratifiedKFold,          # 分层交叉验证
    cross_val_score,          # 交叉验证评分
    GridSearchCV              # 网格搜索
)
from sklearn.metrics import (
    roc_auc_score,            # AUC
    classification_report,    # 分类报告
    confusion_matrix,         # 混淆矩阵
    roc_curve                 # ROC曲线
)
from sklearn.preprocessing import StandardScaler  # 标准化
import xgboost as xgb        # XGBoost
import matplotlib.pyplot as plt  # 画图
import seaborn as sns         # 高级可视化

二、数据预处理¶

2.1 读取和清洗数据¶

# === 读取微生物组丰度数据 ===
# 假设：行=样本，列=微生物特征
abundance = pd.read_csv("genus_abundance.csv", index_col=0)  # 属级丰度表
metadata = pd.read_csv("metadata.csv", index_col=0)          # 元数据

# 确认样本对齐
common_samples = abundance.index.intersection(metadata.index)  # 共同样本
abundance = abundance.loc[common_samples]  # 对齐丰度表
metadata = metadata.loc[common_samples]    # 对齐元数据

print(f"样本数: {len(common_samples)}")
print(f"特征数: {abundance.shape[1]}")
print(f"类别分布:\n{metadata['disease_status'].value_counts()}")

# === 基本过滤 ===
# 去除在<10%样本中出现的低流行度特征
prevalence = (abundance > 0).sum(axis=0) / len(abundance)  # 流行度
abundance = abundance.loc[:, prevalence >= 0.1]  # 保留≥10%流行度的特征
print(f"过滤后特征数: {abundance.shape[1]}")

2.2 数据转换（关键步骤）¶

# === CLR转换（组成数据推荐方法）===
# 微生物组丰度是"组成数据"——相对丰度加起来=100%
# 不能直接用——一个菌上升必然导致其他菌的相对丰度下降（假阴性/假阳性）
# CLR转换打破这个约束

def clr_transform(data):
    """中心化对数比（CLR）转换"""
    # 加伪计数避免log(0)
    data_pseudo = data + 0.5  # 伪计数0.5（也可以用样本最小非零值的一半）

    # 几何均值
    log_data = np.log(data_pseudo)  # 取对数
    geometric_mean = log_data.mean(axis=1)  # 每个样本的几何均值（取对数后的均值）

    # CLR = log(x) - 几何均值
    clr_data = log_data.subtract(geometric_mean, axis=0)  # 减去几何均值

    return clr_data  # 返回CLR转换后的数据

# 执行CLR转换
X_clr = clr_transform(abundance)  # CLR转换
y = metadata['disease_status'].values  # 标签

print(f"CLR转换后数据形状: {X_clr.shape}")
print(f"示例值范围: [{X_clr.values.min():.2f}, {X_clr.values.max():.2f}]")

# === 其他可选转换 ===
# 1. 相对丰度（不推荐直接用于ML）
# relative = abundance.div(abundance.sum(axis=1), axis=0)

# 2. Presence/Absence（二值化，简单但丢失丰度信息）
# binary = (abundance > 0).astype(int)

# 3. 对数转换
# log_data = np.log10(abundance + 1)

三、Random Forest分类¶

3.1 基本分类流程¶

# === Random Forest分类 ===
# 第1步：定义交叉验证
cv = StratifiedKFold(
    n_splits=5,          # 5折交叉验证
    shuffle=True,        # 随机打乱
    random_state=42      # 随机种子（可重复）
)

# 第2步：训练Random Forest
rf = RandomForestClassifier(
    n_estimators=500,    # 500棵树（一般200-1000）
    max_depth=None,      # 不限制深度
    min_samples_leaf=5,  # 叶节点最少5个样本
    max_features='sqrt', # 每次分裂考虑sqrt(n)个特征
    class_weight='balanced',  # 自动平衡类别权重（应对不平衡数据）
    random_state=42,     # 随机种子
    n_jobs=-1            # 使用所有CPU核
)

# 第3步：交叉验证评估
cv_scores = cross_val_score(
    rf, X_clr, y,
    cv=cv,
    scoring='roc_auc'    # 用AUC评估
)

print(f"=== Random Forest 5-fold CV ===")
print(f"AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
print(f"各折AUC: {[f'{s:.3f}' for s in cv_scores]}")

3.2 完整评估流程¶

# === 详细评估（手动交叉验证以获取更多信息）===
from sklearn.metrics import roc_curve, auc  # ROC曲线

all_y_true = []   # 存储真实标签
all_y_prob = []   # 存储预测概率
all_y_pred = []   # 存储预测标签
importances = []  # 存储特征重要性

for fold, (train_idx, test_idx) in enumerate(cv.split(X_clr, y)):
    # 分割数据
    X_train = X_clr.iloc[train_idx]  # 训练集特征
    X_test = X_clr.iloc[test_idx]    # 测试集特征
    y_train = y[train_idx]           # 训练集标签
    y_test = y[test_idx]             # 测试集标签

    # 训练模型
    rf.fit(X_train, y_train)         # 拟合

    # 预测
    y_prob = rf.predict_proba(X_test)[:, 1]  # 预测概率
    y_pred = rf.predict(X_test)              # 预测标签

    # 收集结果
    all_y_true.extend(y_test)
    all_y_prob.extend(y_prob)
    all_y_pred.extend(y_pred)
    importances.append(rf.feature_importances_)  # 特征重要性

    fold_auc = roc_auc_score(y_test, y_prob)
    print(f"Fold {fold+1}: AUC={fold_auc:.3f}")

# === 画ROC曲线 ===
fpr, tpr, _ = roc_curve(all_y_true, all_y_prob)  # 计算ROC
roc_auc = auc(fpr, tpr)  # 计算AUC

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2,
         label=f'RF (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1,
         label='Random (AUC = 0.500)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve - Microbiome Classification', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.pdf', dpi=300)
plt.close()

# === 打印分类报告 ===
print("\n=== Classification Report ===")
print(classification_report(all_y_true, all_y_pred))

3.3 特征重要性分析¶

# === 特征重要性（哪些微生物最重要？）===
mean_importance = np.mean(importances, axis=0)  # 平均重要性
std_importance = np.std(importances, axis=0)     # 重要性标准差

# 排序
feature_imp = pd.DataFrame({
    'feature': X_clr.columns,          # 特征名
    'importance': mean_importance,      # 重要性
    'std': std_importance               # 标准差
}).sort_values('importance', ascending=False)

# Top 20重要特征
print("=== Top 20 Important Features ===")
print(feature_imp.head(20).to_string(index=False))

# 画条形图
top20 = feature_imp.head(20)
plt.figure(figsize=(10, 8))
plt.barh(range(len(top20)), top20['importance'],
         xerr=top20['std'], color='steelblue', alpha=0.8)
plt.yticks(range(len(top20)), top20['feature'])
plt.xlabel('Feature Importance (Mean Decrease Impurity)')
plt.title('Top 20 Important Microbial Features')
plt.gca().invert_yaxis()  # 最重要的在上面
plt.tight_layout()
plt.savefig('feature_importance.pdf', dpi=300)
plt.close()

四、XGBoost分类¶

# === XGBoost分类 ===
xgb_clf = xgb.XGBClassifier(
    n_estimators=300,           # 300棵树
    max_depth=4,                # 最大深度4（防止过拟合）
    learning_rate=0.1,          # 学习率
    subsample=0.8,              # 每棵树使用80%的样本
    colsample_bytree=0.8,       # 每棵树使用80%的特征
    min_child_weight=5,         # 叶节点最小权重
    scale_pos_weight=len(y[y==0])/len(y[y==1]),  # 类别权重平衡
    random_state=42,
    eval_metric='auc',          # 评估指标
    use_label_encoder=False
)

# 交叉验证
xgb_scores = cross_val_score(
    xgb_clf, X_clr, y,
    cv=cv, scoring='roc_auc'
)

print(f"=== XGBoost 5-fold CV ===")
print(f"AUC: {xgb_scores.mean():.3f} ± {xgb_scores.std():.3f}")

# === 超参数调优 ===
param_grid = {
    'n_estimators': [100, 300, 500],       # 树数量
    'max_depth': [3, 4, 6],                # 最大深度
    'learning_rate': [0.01, 0.05, 0.1],    # 学习率
    'min_child_weight': [3, 5, 7]          # 最小子权重
}

grid_search = GridSearchCV(
    xgb_clf, param_grid,
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=0
)
grid_search.fit(X_clr, y)

print(f"\n最优参数: {grid_search.best_params_}")
print(f"最优AUC: {grid_search.best_score_:.3f}")

五、防止数据泄露（关键！）¶

5.1 常见数据泄露陷阱¶

# === 数据泄露的常见陷阱 ===

# ❌ 错误1：在交叉验证之前做特征选择
# 用全部数据选特征 → 测试集信息泄露到训练中
# 错误示例：
# from sklearn.feature_selection import SelectKBest
# X_selected = SelectKBest(k=50).fit_transform(X_clr, y)  # 错误！
# scores = cross_val_score(rf, X_selected, y, cv=cv)       # 虚高的AUC

# ✅ 正确做法：在交叉验证的每一折内部做特征选择
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

pipeline = Pipeline([
    ('feature_selection', SelectKBest(k=50, score_func=f_classif)),  # 选特征
    ('classifier', RandomForestClassifier(n_estimators=500,
                                          random_state=42,
                                          n_jobs=-1))
])

correct_scores = cross_val_score(pipeline, X_clr, y, cv=cv, scoring='roc_auc')
print(f"正确的AUC: {correct_scores.mean():.3f} ± {correct_scores.std():.3f}")

# ❌ 错误2：在交叉验证之前做数据标准化/归一化
# 错误示例：
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_clr)  # 错误！用了全部数据fit

# ✅ 正确做法：放在Pipeline里
pipeline_scaled = Pipeline([
    ('scaler', StandardScaler()),          # 标准化
    ('classifier', RandomForestClassifier(n_estimators=500,
                                          random_state=42))
])

# ❌ 错误3：不做分层抽样
# 错误示例：KFold(n_splits=5)  # 可能某一折全是对照或全是病例
# ✅ 正确：StratifiedKFold(n_splits=5)  # 保持每折的类别比例一致

5.2 嵌套交叉验证（金标准）¶

# === 嵌套交叉验证：超参数调优+性能评估 ===
from sklearn.model_selection import cross_val_score

# 外层CV：评估泛化性能
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 内层CV（在GridSearchCV中）：选择超参数
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# GridSearchCV作为估计器
grid = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid={
        'n_estimators': [200, 500],
        'max_depth': [None, 10],
        'min_samples_leaf': [3, 5]
    },
    cv=inner_cv,         # 内层CV
    scoring='roc_auc',
    n_jobs=-1
)

# 外层CV评估
nested_scores = cross_val_score(
    grid, X_clr, y,
    cv=outer_cv,          # 外层CV
    scoring='roc_auc'
)

print(f"=== 嵌套CV结果（无偏估计）===")
print(f"AUC: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")

六、模型比较框架¶

# === 多模型比较 ===
from sklearn.linear_model import LogisticRegression  # 逻辑回归
from sklearn.svm import SVC                          # 支持向量机
from sklearn.neural_network import MLPClassifier     # 神经网络

models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=500, random_state=42, n_jobs=-1,
        class_weight='balanced'),
    'XGBoost': xgb.XGBClassifier(
        n_estimators=300, max_depth=4, learning_rate=0.1,
        random_state=42, use_label_encoder=False, eval_metric='auc'),
    'Logistic Regression': LogisticRegression(
        max_iter=1000, class_weight='balanced', random_state=42),
    'SVM': SVC(
        kernel='rbf', probability=True,
        class_weight='balanced', random_state=42),
}

results = {}  # 存储结果
for name, model in models.items():
    scores = cross_val_score(model, X_clr, y, cv=cv, scoring='roc_auc')
    results[name] = scores
    print(f"{name}: AUC = {scores.mean():.3f} ± {scores.std():.3f}")

# 画箱线图比较
results_df = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
results_df.boxplot()
plt.ylabel('AUC')
plt.title('Model Comparison (5-fold CV)')
plt.xticks(rotation=15)
plt.tight_layout()
plt.savefig('model_comparison.pdf', dpi=300)
plt.close()

七、常见报错与解决¶

报错信息	原因	解决方案
`AUC=1.0 (perfect)`	几乎肯定存在数据泄露	检查Pipeline和CV流程
`Class imbalance warning`	类别不平衡	用`class_weight='balanced'`或SMOTE
`Too few samples`	样本量太小（<50）	用留一法CV或增加样本
`NaN in features`	数据中有缺失值	用中位数填充或删除
`Overfitting`	训练AUC>>测试AUC	减少特征数/增加正则化
`Negative AUC`	标签编码反了	检查y的编码方向

八、面试高频题¶

Q1：微生物组数据做机器学习有哪些挑战？¶

答：五大挑战：(1) 组成性——相对丰度加起来=100%，一个菌上升必然导致其他菌下降，直接用相对丰度会引入假关联。解决：用CLR转换；(2) 高维低样本——通常几百个微生物特征但只有几十到几百个样本（"维度灾难"）。解决：特征选择+正则化；(3) 稀疏性——大量零值（很多菌在很多样本中检测不到）。解决：合理的伪计数策略；(4) 批次效应——不同研究的数据分布不同。解决：跨研究验证或留一研究CV；(5) 数据泄露风险——微生物组研究中常犯的错误（在CV外做特征选择）。解决：严格使用Pipeline。

Q2：为什么推荐CLR转换而不是直接用相对丰度？¶

答：微生物组数据是组成数据（compositional data），相对丰度受总和为1的约束。这导致：(1) 变量之间存在人为的负相关——一个菌的丰度升高，其他菌的相对丰度必然降低，但这不反映真实的生态关系；(2) 标准统计方法（如t检验、相关性、ML算法）假设变量是独立的，组成约束违反了这个假设。CLR转换（Central Log-Ratio）通过取对数再减去几何均值打破这个约束，使数据可以在欧氏空间中正常使用。实际项目中，CLR转换后的Random Forest通常比直接用相对丰度AUC高0.05-0.10。

Q3：如何防止微生物组机器学习中的数据泄露？¶

答：三个关键原则：(1) 一切数据依赖操作都放在交叉验证内部——特征选择、数据标准化、过采样（SMOTE）都必须只在训练集上fit，不能看到测试集。用sklearn的Pipeline可以自动保证这一点；(2) 使用分层交叉验证——StratifiedKFold保证每折的类别比例一致；(3) 跨研究验证——如果数据来自多个研究，用Leave-One-Study-Out CV是最严格的验证方式，能测试模型的泛化能力。常见泄露信号：AUC>0.95通常可疑（微生物组分类AUC一般在0.70-0.90之间是合理的）。

Q4：Random Forest和XGBoost在微生物组分析中怎么选？¶

答： Random Forest是微生物组ML的"默认选择"——优点是：(1) 对高维稀疏数据鲁棒；(2) 不容易过拟合（bagging机制）；(3) 内置特征重要性评估；(4) 不需要太多超参数调优。XGBoost通常能获得略高的AUC（提升0.01-0.03），但：(1) 对超参数更敏感，需要仔细调优；(2) 更容易过拟合小样本数据；(3) 训练时间更长。实际建议：先用Random Forest作为基线，如果需要挤压最后一点性能再试XGBoost。Logistic Regression作为可解释性基线也值得尝试。最终应该报告多个模型的结果比较。

参考资料：Microbiome ML综述: Topçuoğlu et al., mBio 2020 | OMA ML chapter: bioconductor.org | sklearn: scikit-learn.org | XGBoost: Chen & Guestrin, KDD 2016 | Wirbel et al., Nat Med 2019 | 数据泄露: Pasolli et al., PLoS Comp Biol 2016