Scikit-learn 机器学习¶

一句话概述：Scikit-learn 是 Python 最成熟的机器学习库，提供分类、回归、聚类、降维、特征工程、模型评估等全套工具，API 统一且文档完善，是入门机器学习的首选。最新版 1.8.0 支持 GPU 计算（Array API）。

核心知识点¶

概念	白话解释
Estimator	估计器 = 所有模型的统一接口（fit/predict）
fit()	训练 = 让模型从数据中学习
predict()	预测 = 用训练好的模型做预测
Pipeline	管道 = 串联预处理 + 模型为一个整体
Cross-validation	交叉验证 = 把数据分 K 份轮流验证（防止过拟合）
Grid Search	网格搜索 = 自动尝试不同参数组合找最优

安装配置¶

pip install scikit-learn                              # 安装
python -c "import sklearn; print(sklearn.__version__)" # 验证（1.8.0）

基本使用¶

from sklearn.model_selection import train_test_split  # 数据划分
from sklearn.ensemble import RandomForestClassifier   # 随机森林
from sklearn.metrics import accuracy_score, classification_report  # 评估

import pandas as pd
import numpy as np

# 加载数据
df = pd.read_csv("data.csv")                          # 读取数据
X = df.drop('target', axis=1)                         # 特征矩阵
y = df['target']                                      # 标签向量

# 划分训练集/测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # 80%训练 20%测试
)

# 训练模型
model = RandomForestClassifier(
    n_estimators=100,                                 # 100 棵树
    max_depth=10,                                     # 最大深度 10
    random_state=42                                   # 随机种子
)
model.fit(X_train, y_train)                           # 训练

# 预测和评估
y_pred = model.predict(X_test)                        # 预测
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")  # 准确率
print(classification_report(y_test, y_pred))          # 详细报告

数据预处理¶

from sklearn.preprocessing import StandardScaler, LabelEncoder  # 预处理
from sklearn.impute import SimpleImputer              # 缺失值填充

# 标准化
scaler = StandardScaler()                             # 标准化器
X_train_scaled = scaler.fit_transform(X_train)        # 训练集 fit + transform
X_test_scaled = scaler.transform(X_test)              # 测试集只 transform

# 缺失值填充
imputer = SimpleImputer(strategy='median')            # 中位数填充
X_filled = imputer.fit_transform(X)                   # 填充缺失值

# 标签编码
le = LabelEncoder()                                   # 标签编码器
y_encoded = le.fit_transform(y)                       # 文本标签→数字

Pipeline 管道¶

from sklearn.pipeline import Pipeline                 # 管道
from sklearn.preprocessing import StandardScaler

# 创建管道：预处理 → 模型
pipe = Pipeline([
    ('scaler', StandardScaler()),                     # 第1步：标准化
    ('clf', RandomForestClassifier(n_estimators=100)) # 第2步：分类
])

pipe.fit(X_train, y_train)                            # 一步训练
y_pred = pipe.predict(X_test)                         # 一步预测

交叉验证¶

from sklearn.model_selection import cross_val_score   # 交叉验证

scores = cross_val_score(
    model, X, y, cv=5, scoring='accuracy'             # 5 折交叉验证
)
print(f"CV 准确率: {scores.mean():.4f} ± {scores.std():.4f}")

高级用法¶

网格搜索调参¶

from sklearn.model_selection import GridSearchCV      # 网格搜索

param_grid = {
    'n_estimators': [50, 100, 200],                   # 树的数量
    'max_depth': [5, 10, 20, None],                   # 最大深度
    'min_samples_split': [2, 5, 10]                   # 最小分裂样本数
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1   # 并行搜索
)
grid_search.fit(X_train, y_train)

print(f"最优参数: {grid_search.best_params_}")         # 最优参数
print(f"最优得分: {grid_search.best_score_:.4f}")      # 最优得分

特征重要性¶

import matplotlib.pyplot as plt

# 随机森林特征重要性
importances = model.feature_importances_               # 重要性分数
indices = np.argsort(importances)[::-1][:20]          # 前 20 个

plt.figure(figsize=(10, 6))
plt.bar(range(20), importances[indices])
plt.xticks(range(20), X.columns[indices], rotation=45)
plt.title('Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)

常见报错¶

报错信息	原因	解决方法
`ValueError: NaN`	数据有缺失值	先用 `SimpleImputer` 填充
`NotFittedError`	模型未训练	先调用 `.fit()`
`ValueError: shape`	特征数不匹配	训练和预测用相同特征

速查表¶

# === 常用模型 ===
# 分类: RandomForestClassifier, SVC, LogisticRegression, KNeighborsClassifier
# 回归: RandomForestRegressor, SVR, LinearRegression, Ridge, Lasso
# 聚类: KMeans, DBSCAN, AgglomerativeClustering
# 降维: PCA, TSNE, UMAP(需额外安装)

# === 核心流程 ===
# 1. train_test_split()     → 划分数据
# 2. model.fit(X_train)     → 训练
# 3. model.predict(X_test)  → 预测
# 4. accuracy_score()       → 评估
# 5. cross_val_score()      → 交叉验证

# === 评估指标 ===
# accuracy_score()   — 准确率
# precision_score()  — 精确率
# recall_score()     — 召回率
# f1_score()         — F1 分数
# roc_auc_score()    — AUC
# confusion_matrix() — 混淆矩阵

参考：Scikit-learn 文档 | 最新版 1.8.0 (2025.12) | 更新于 2026 年