659 微生物组时间序列预测¶

一句话概述：利用ARIMA、LSTM、GNN等模型预测微生物组随时间的动态变化，为疾病预警和干预时机选择提供依据。

核心知识点速查表¶

知识点	关键内容
ARIMA/ARIMAX	经典时序模型，捕捉线性趋势和季节性
LSTM	深度学习，捕捉长期非线性依赖
GNN（2025新）	图神经网络建模物种互作+时序预测
MicroSTNet	时空图卷积+LSTM混合架构
扩散模型	用于缺失时间点的插补
关键挑战	时间点稀疏、个体差异大、非稳态数据

一、为什么要做微生物组时间序列预测？（白话解释）¶

打个比方：就像天气预报预测未来几天的天气一样，微生物组时间序列预测是"菌群预报"——根据过去的菌群变化趋势，预测未来的菌群状态。

应用场景： - 预测疾病发作前的菌群信号（早期预警） - 评估干预（如益生菌）的最佳时机 - 理解菌群恢复动力学（如抗生素后）

二、经典方法：ARIMA/ARIMAX¶

# ARIMA模型预测微生物丰度时间序列
import pandas as pd  # 数据处理
import numpy as np  # 数值计算
from statsmodels.tsa.arima.model import ARIMA  # ARIMA模型
from statsmodels.tsa.stattools import adfuller  # ADF平稳性检验
import matplotlib.pyplot as plt  # 绑图

# 读取某物种的纵向丰度数据
ts_data = pd.read_csv("bacteroides_timeseries.csv",
                       parse_dates=["date"],  # 解析日期列
                       index_col="date")  # 设置日期为索引

# 1. 平稳性检验（ARIMA要求数据平稳）
result = adfuller(ts_data["abundance"])  # ADF检验
print(f"ADF统计量: {result[0]:.4f}")  # ADF值
print(f"p值: {result[1]:.4f}")  # p<0.05则平稳
if result[1] > 0.05:
    print("数据不平稳，需要差分")
    ts_data["abundance_diff"] = ts_data["abundance"].diff().dropna()  # 一阶差分

# 2. 拟合ARIMA模型
model = ARIMA(ts_data["abundance"],
              order=(2, 1, 1))  # (p=2, d=1, q=1)：2阶自回归，1阶差分，1阶移动平均
fitted = model.fit()  # 拟合模型
print(fitted.summary())  # 输出模型摘要

# 3. 预测未来时间点
forecast = fitted.forecast(steps=10)  # 预测未来10个时间点
print("未来10个时间点预测值:")
print(forecast)

# 4. 可视化
fig, ax = plt.subplots(figsize=(12, 4))
ts_data["abundance"].plot(ax=ax, label="观测值")  # 实际值
forecast.plot(ax=ax, label="预测值", color="red")  # 预测值
ax.set_xlabel("时间")
ax.set_ylabel("丰度")
ax.legend()
plt.tight_layout()
plt.savefig("arima_forecast.png", dpi=150)  # 保存图片

带外部变量的ARIMAX¶

# ARIMAX：考虑外部因素（如饮食、用药）对菌群的影响
from statsmodels.tsa.statespace.sarimax import SARIMAX  # SARIMAX模型

# 外部变量：饮食纤维摄入量和抗生素使用
exog_vars = pd.DataFrame({
    "fiber_intake": fiber_data,  # 膳食纤维摄入（连续）
    "antibiotics": antibiotic_data  # 抗生素使用（0/1）
}, index=ts_data.index)

# 拟合ARIMAX
model_x = SARIMAX(ts_data["abundance"],
                   exog=exog_vars,  # 外部变量
                   order=(2, 1, 1),  # ARIMA参数
                   seasonal_order=(1, 0, 0, 7))  # 周季节性（如每周周期）
fitted_x = model_x.fit(disp=False)  # 静默拟合
print(f"AIC: {fitted_x.aic:.2f}")  # 模型选择指标

三、深度学习方法：LSTM¶

# LSTM预测微生物组时间序列
import torch  # PyTorch框架
import torch.nn as nn  # 神经网络模块
from torch.utils.data import Dataset, DataLoader  # 数据加载

class TimeSeriesDataset(Dataset):
    """微生物组时间序列数据集"""
    def __init__(self, data, window_size=10):
        self.data = torch.FloatTensor(data)  # 转为张量
        self.window_size = window_size  # 滑动窗口大小

    def __len__(self):
        return len(self.data) - self.window_size  # 可生成的样本数

    def __getitem__(self, idx):
        x = self.data[idx:idx+self.window_size]  # 输入：前window_size个时间点
        y = self.data[idx+self.window_size]  # 目标：下一个时间点
        return x, y

class MicrobiomeLSTM(nn.Module):
    """LSTM微生物组预测模型"""
    def __init__(self, n_taxa, hidden_size=64, n_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(n_taxa, hidden_size, n_layers,
                           batch_first=True, dropout=0.2)  # LSTM层
        self.fc = nn.Linear(hidden_size, n_taxa)  # 输出层

    def forward(self, x):
        out, _ = self.lstm(x)  # LSTM处理序列
        return self.fc(out[:, -1, :])  # 取最后时间步预测

# 训练流程
n_taxa = 50  # 物种数
model = MicrobiomeLSTM(n_taxa)  # 初始化模型
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)  # Adam优化器
criterion = nn.MSELoss()  # 均方误差损失

# 训练循环
for epoch in range(100):  # 训练100轮
    model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:  # 遍历批次
        optimizer.zero_grad()  # 梯度清零
        pred = model(X_batch)  # 前向传播
        loss = criterion(pred, y_batch)  # 计算损失
        loss.backward()  # 反向传播
        optimizer.step()  # 更新参数
        total_loss += loss.item()
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.6f}")

四、2025前沿：GNN时序预测¶

2025年Nature Communications发表的研究表明，GNN可以准确预测未来2-4个月的物种动态：

# GNN微生物组时序预测（概念代码）
import torch_geometric  # PyTorch Geometric图学习库

class MicrobiomeGNN(nn.Module):
    """基于GNN的微生物组时序预测"""
    def __init__(self, n_taxa, hidden_dim=64):
        super().__init__()
        from torch_geometric.nn import GCNConv  # 图卷积层
        self.conv1 = GCNConv(1, hidden_dim)  # 第一层图卷积
        self.conv2 = GCNConv(hidden_dim, hidden_dim)  # 第二层图卷积
        self.temporal = nn.GRU(hidden_dim, hidden_dim, batch_first=True)  # 时序编码
        self.predictor = nn.Linear(hidden_dim, 1)  # 预测层

    def forward(self, x_seq, edge_index):
        """
        x_seq: [batch, time_steps, n_taxa] 历史丰度
        edge_index: [2, n_edges] 物种互作边
        """
        temporal_features = []
        for t in range(x_seq.shape[1]):  # 遍历每个时间步
            x_t = x_seq[:, t, :].unsqueeze(-1)  # 当前时间步的丰度
            # 图卷积聚合邻居信息
            h = torch.relu(self.conv1(x_t.squeeze(0), edge_index))
            h = self.conv2(h, edge_index)
            temporal_features.append(h.unsqueeze(1))

        # 时序编码
        h_seq = torch.cat(temporal_features, dim=1)  # 拼接所有时间步
        out, _ = self.temporal(h_seq)  # GRU编码时序信息
        return self.predictor(out[:, -1, :])  # 预测下一时间步

五、MicroSTNet（2025年最新）¶

2025年PMC发表的MicroSTNet结合时空图卷积和LSTM：

模型	短期预测	长期预测	物种互作
ARIMA	中等	差	不考虑
LSTM	差	中等	不考虑
GNN	好	好	考虑
MicroSTNet	最好	好	考虑

常见报错与解决¶

报错	原因	解决方案
ARIMA拟合失败"非平稳"	数据需要差分	增加d参数或先做对数转换
LSTM loss不收敛	学习率太大或数据未标准化	降低lr到1e-4，对数据做MinMax缩放
预测值全部为均值	模型学到了"偷懒解"	检查数据是否有足够变异，增加模型复杂度
时间点不够（<20）	微生物组纵向数据常稀疏	考虑扩散模型插补或换用更简单的模型
组成性数据直接建模	相对丰度sum-to-one约束	先做CLR转换再建模

速查表¶

# 方法选择决策树
时间点 < 20 → ARIMA (简单稳健)
时间点 20-50 → LSTM / ARIMAX
时间点 > 50 + 有互作信息 → GNN / MicroSTNet
有缺失时间点 → 扩散模型插补后再预测

# 关键参数
ARIMA: (p,d,q) 用AIC/BIC选择
LSTM: window_size=5-20, hidden=64-256
GNN: 图构建阈值（相关系数>0.3/0.5）

# Python工具
statsmodels: ARIMA/SARIMAX
PyTorch: LSTM/GNN自定义模型
MIMIC (2025): 集成gLV/GP/VAR多种模型的Python包

面试高频问题¶

Q1：微生物组时间序列预测有什么特殊挑战？ A：(1) 时间点稀疏（通常<20个采样点）；(2) 数据组成性（相对丰度sum-to-one）；(3) 高维度（数百物种）；(4) 零膨胀；(5) 个体间差异大。

Q2：ARIMA和LSTM分别适合什么场景？ A：ARIMA适合线性趋势和季节性模式，对时间点要求少（>10即可）；LSTM适合非线性长期依赖，但需要更多时间点（>30-50）。2024年研究发现标准SARIMA对菌群数据不够，需要用Dynamic ARIMAX加FFT季节分解。

Q3：如何用GNN做微生物组时序预测？ A：(1) 构建物种互作图（基于相关性/共现）；(2) 每个时间步用GCN聚合邻居信息；(3) 时序维度用GRU/LSTM编码；(4) 预测未来时间步的丰度。2025年Nature Communications研究表明可预测2-4个月ahead。

Q4：2025年有什么新的时序预测工具？ A：(1) MIMIC（Python包，集成gLV/GP/VAR/ML方法）；(2) MicroSTNet（时空图卷积+LSTM）；(3) 扩散模型（条件分数扩散模型+系统发育卷积层用于缺失数据插补）。