梯度提升树在生信应用¶

一句话概述：梯度提升树（Gradient Boosting Trees, GBT）通过逐步添加弱学习器来纠正前一轮的错误，以XGBoost和LightGBM为代表，在生信数据分类、预后预测和特征选择中表现出色，通常比随机森林准确度更高。

核心知识点速查表¶

概念	说明
Boosting	串行集成学习，每棵新树纠正前面的错误（白话：不断补课提高成绩）
XGBoost	eXtreme Gradient Boosting，最流行的GBT实现
LightGBM	微软的轻量级GBT，大数据更快
学习率	每棵树的贡献权重，越小越保守
过拟合	GBT比RF更容易过拟合，需要调参
SHAP值	可解释性工具，解释每个特征对预测的贡献

一、XGBoost实操¶

# === XGBoost分类（Python） ===
import xgboost as xgb
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np

# 基本模型
model = xgb.XGBClassifier(
    n_estimators=200,              # 树的数量
    max_depth=6,                    # 最大深度
    learning_rate=0.1,              # 学习率
    subsample=0.8,                  # 行采样比例
    colsample_bytree=0.8,          # 列采样比例
    random_state=42,                # 随机种子
    eval_metric='auc',              # 评估指标
    use_label_encoder=False         # 禁用标签编码警告
)
model.fit(X_train, y_train)         # 训练

# 交叉验证
scores = cross_val_score(model, X, y, cv=10, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f} ± {scores.std():.3f}")

# 特征重要性
importance = model.feature_importances_
top_idx = np.argsort(importance)[::-1][:20]
for i, idx in enumerate(top_idx):
    print(f"{i+1}. {gene_names[idx]}: {importance[idx]:.4f}")

# SHAP可解释性分析
import shap
explainer = shap.TreeExplainer(model)        # 创建解释器
shap_values = explainer.shap_values(X_test)  # 计算SHAP值
shap.summary_plot(shap_values, X_test, feature_names=gene_names)  # 汇总图

# === XGBoost分类（R语言） ===
library(xgboost)

# 准备DMatrix（XGBoost专用数据格式）
dtrain <- xgb.DMatrix(data=as.matrix(X_train), label=y_train)
dtest <- xgb.DMatrix(data=as.matrix(X_test), label=y_test)

# 设置参数
params <- list(
  objective = "binary:logistic",     # 二分类
  eval_metric = "auc",               # AUC评估
  max_depth = 6,                     # 最大深度
  eta = 0.1,                         # 学习率
  subsample = 0.8,                   # 行采样
  colsample_bytree = 0.8             # 列采样
)

# 交叉验证确定最优迭代次数
cv_result <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 500,                     # 最大轮数
  nfold = 10,                        # 10折CV
  early_stopping_rounds = 20,        # 早停
  verbose = 1
)
best_nrounds <- cv_result$best_iteration

# 训练最终模型
model <- xgb.train(params, dtrain, nrounds=best_nrounds)

# 特征重要性
importance <- xgb.importance(model=model)
xgb.plot.importance(importance, top_n=20)    # 可视化前20个特征

二、面试高频考点¶

Q1: 随机森林 vs 梯度提升树？¶

	随机森林(Bagging)	梯度提升树(Boosting)
训练方式	并行（独立建树）	串行（逐步纠错）
过拟合	不容易	容易（需要早停和调参）
准确度	通常稍低	通常更高
训练速度	较快（可并行）	较慢（必须串行）
调参难度	简单	复杂
白话	各自独立投票	一个教一个

Q2: XGBoost为什么在Kaggle比赛中这么火？¶

正则化防过拟合（L1+L2）
二阶导数优化更精确
处理缺失值能力强
并行计算加速
内置交叉验证和早停

常见报错与解决¶

报错	原因	解决方案
过拟合(train>>test)	树太深或太多	减小max_depth、增加正则化
训练太慢	数据大参数多	用LightGBM或减少特征
类别不平衡	正负样本悬殊	设置scale_pos_weight

速查表¶

# === XGBoost速查 ===
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)

# SHAP解释
import shap
shap.summary_plot(shap.TreeExplainer(model).shap_values(X_test), X_test)

# 关键调参: n_estimators(树数) | max_depth(深度) | learning_rate(学习率)
# 防过拟合: subsample<1 | colsample_bytree<1 | reg_alpha/reg_lambda>0