HuggingFace Transformers 入门¶

一句话说明¶

HuggingFace Transformers 是 AI 界最大的"模型商店 + 工具箱"，让你用几行 Python 代码就能调用最先进的 NLP、CV、蛋白质分析等预训练模型，做文本分类、翻译、生成、图像识别等任务。

为什么要学¶

HuggingFace 在 AI 生态中的地位¶

GitHub Star 数：160,000+，Fork 33,000+，是全球最热门的 AI 开源项目之一
白话说：如果 GitHub 是"代码界的淘宝"，那 HuggingFace 就是"AI 模型界的淘宝"——上面有几十万个别人训练好的模型，你直接下载就能用
几乎所有 AI 论文发布时都会同步上传模型到 HuggingFace（比如 Meta 的 LLaMA、Google 的 BERT/Gemma、DeepSeek 等）
面试加分项：能说出"我用 HuggingFace 加载过 ESM 蛋白质模型做特征提取"，比说"我了解深度学习"强十倍

生信方向能用来做什么¶

应用场景	具体例子
蛋白质分析	用 ESM 模型提取蛋白质序列特征，预测功能
文献挖掘	用 NLP 模型做文献摘要、信息提取
基因组注释	用 DNA 语言模型（如 DNABERT）做基因功能预测
报告生成	用 LLM 自动生成分析报告
数据分析辅助	用文本嵌入做相似基因/蛋白质检索

核心概念详解¶

1. HuggingFace 是什么¶

白话：AI 界的 GitHub

HuggingFace（🤗，名字来自这个拥抱表情）是一家 AI 公司，但更重要的是它建了一个开放平台
就像 GitHub 让全世界程序员分享代码一样，HuggingFace 让全世界 AI 研究者分享训练好的模型
你不需要自己花几百万去训练模型，直接从上面下载别人训练好的就能用

2. Transformers 库是什么¶

HuggingFace 出品的 Python 库，是你和平台上那些模型之间的"桥梁"
白话：HuggingFace 网站是"商店货架"，transformers 库是"购物车 + 使用说明书"——帮你下载模型、加载模型、用模型做预测
支持 PyTorch、TensorFlow、JAX 三大深度学习框架（我们主要用 PyTorch）

3. Model Hub 是什么¶

白话：模型商店

网址：huggingface.co/models
上面有 100万+ 个预训练模型，覆盖文本、图像、音频、多模态、生物序列等
每个模型都有一个唯一 ID，比如 bert-base-chinese、facebook/esm2_t6_8M_UR50D
你在代码里用这个 ID 就能直接下载和使用

4. Tokenizer 是什么¶

白话：把文字切成模型能吃的小块

模型不认识中文字或英文单词，它只认识数字
Tokenizer 的工作就是：文字 → 切成小块（token） → 转成数字编号
例如："我爱生信" → ["我", "爱", "生", "信"] → [2769, 4263, 4495, 928]
每个模型都有自己配套的 Tokenizer（词表不同），所以加载模型时必须用配套的 Tokenizer

5. Pipeline 是什么¶

白话：一键完成任务的流水线

Pipeline 把"加载模型 → 预处理数据 → 推理 → 后处理结果"这一整套操作打包成一行代码
你只需要说"我要做情感分析"，Pipeline 自动帮你选模型、加载、运行、返回结果
适合入门和快速原型，后面熟了再自己拆开手动控制

6. 预训练模型 vs 微调¶

对比	预训练模型（Pretrained）	微调（Fine-tuning）
白话	出厂设置的通才	针对你的任务再培训的专才
比喻	医学院毕业的全科医生	去心内科进修过的心脏专科医生
数据量	海量通用数据（几百 GB 文本）	你自己的小数据集（几百到几千条）
用法	直接用 `from_pretrained()` 加载	用 `Trainer` 在自己数据上继续训练
什么时候用	通用任务、快速验证	对特定领域需要更高准确率时

7. Datasets 库是什么¶

HuggingFace 提供的数据集管理库，和 transformers 是兄弟关系
上面有 20万+ 数据集，一行代码下载，自动分好训练集/测试集
白话：如果模型是"厨师"，那 Datasets 就是"食材超市"

环境安装与配置¶

安装¶

# 创建专用 conda 环境（推荐）
conda create -n hf python=3.10 -y
conda activate hf

# 安装核心库
pip install transformers    # 核心：模型加载与推理
pip install torch           # PyTorch 深度学习框架（transformers 的后端）
pip install datasets        # 数据集管理
pip install accelerate      # 加速推理和训练
pip install sentencepiece   # 一些模型需要的分词器

# 可选：如果要用 sentence-transformers 做文本嵌入
pip install sentence-transformers

配置国内镜像（解决下载慢的问题）¶

HuggingFace 服务器在国外，直接下载可能很慢或断连。设置国内镜像：

# 方法 1：临时设置（当前终端有效）
export HF_ENDPOINT=https://hf-mirror.com

# 方法 2：永久设置（写入 ~/.bashrc）
echo 'export HF_ENDPOINT=https://hf-mirror.com' >> ~/.bashrc
source ~/.bashrc

# 验证是否生效
echo $HF_ENDPOINT
# 应该输出：https://hf-mirror.com

配置缓存路径¶

模型文件很大（几百 MB 到几十 GB），默认缓存在 ~/.cache/huggingface/，可以改到大磁盘：

# 设置缓存目录到大磁盘
export HF_HOME=/data/huggingface_cache

# 永久生效
echo 'export HF_HOME=/data/huggingface_cache' >> ~/.bashrc
source ~/.bashrc

配置代理¶

如果你有代理（比如 v2rayN），可以让 Python 走代理下载：

# 设置 HTTP/HTTPS 代理
export http_proxy=http://127.0.0.1:10809
export https_proxy=http://127.0.0.1:10809

# 或者设置 SOCKS5 代理
export all_proxy=socks5://127.0.0.1:10808

实操教程¶

1. 入门：Pipeline 一行代码完成 NLP 任务¶

Pipeline 是最简单的入口，适合快速体验。

情感分析（Sentiment Analysis）¶

from transformers import pipeline

# 创建情感分析 pipeline
# pipeline 会自动下载默认模型（distilbert-base-uncased-finetuned-sst-2-english）
classifier = pipeline("sentiment-analysis")

# 分析一句话的情感
result = classifier("I love bioinformatics!")
print(result)
# 输出: [{'label': 'POSITIVE', 'score': 0.9998}]
# label = 情感标签（正面/负面）
# score = 置信度（0-1，越接近1越确定）

# 可以一次分析多句话
results = classifier([
    "This tool is amazing!",        # 正面
    "The experiment failed again."   # 负面
])
print(results)
# [{'label': 'POSITIVE', 'score': 0.9998},
#  {'label': 'NEGATIVE', 'score': 0.9994}]

文本分类（Text Classification）¶

from transformers import pipeline

# 零样本分类：不需要训练，直接告诉模型有哪些类别
classifier = pipeline("zero-shot-classification")

# 给一段文字，让模型判断属于哪个类别
result = classifier(
    "Gut microbiota composition is altered in type 2 diabetes patients",
    candidate_labels=["biology", "physics", "computer science"]  # 候选类别
)
print(f"类别: {result['labels'][0]}, 置信度: {result['scores'][0]:.4f}")
# 输出: 类别: biology, 置信度: 0.9812

文本摘要（Summarization）¶

from transformers import pipeline

# 创建摘要 pipeline
summarizer = pipeline("summarization")

# 一段长文本
long_text = """
Metagenomics is the study of genetic material recovered directly from
environmental samples. The broad field may also be referred to as
environmental genomics, ecogenomics, or community genomics. It allows
the genomic analysis of uncultured microorganisms, providing insight
into the diversity and functional potential of microbial communities
in various environments including the human gut.
"""

# 生成摘要
summary = summarizer(long_text, max_length=50, min_length=20)
print(summary[0]['summary_text'])
# 输出简短摘要

翻译（Translation）¶

from transformers import pipeline

# 英译中（指定翻译模型）
translator = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-en-zh"  # 英语→中文翻译模型
)

result = translator("Random forest is an ensemble learning method.")
print(result[0]['translation_text'])
# 输出: 随机森林是一种集合学习方法。

问答（Question Answering）¶

from transformers import pipeline

# 创建问答 pipeline（抽取式问答：从给定文本中抽取答案）
qa = pipeline("question-answering")

# 给一段背景文本 + 一个问题
result = qa(
    question="What is metagenomics?",
    context="Metagenomics is the study of genetic material recovered "
            "directly from environmental samples. It enables genomic "
            "analysis of uncultured microorganisms."
)
print(f"答案: {result['answer']}")
print(f"置信度: {result['score']:.4f}")
# 输出: 答案: the study of genetic material recovered directly from environmental samples

2. 加载预训练模型和 Tokenizer¶

Pipeline 虽然方便，但有时你需要更细粒度的控制。下面是手动加载的完整流程：

from transformers import AutoTokenizer, AutoModel
import torch

# ===== 第 1 步：加载 Tokenizer 和 Model =====
# AutoTokenizer/AutoModel 会根据模型名自动选择正确的类
model_name = "bert-base-uncased"  # 模型的 HuggingFace ID

tokenizer = AutoTokenizer.from_pretrained(model_name)  # 加载配套的分词器
model = AutoModel.from_pretrained(model_name)           # 加载模型权重

# ===== 第 2 步：用 Tokenizer 处理输入文本 =====
text = "Gut microbiota plays important roles in human health."

# tokenizer() 做了三件事：分词 → 转数字 → 加特殊标记
inputs = tokenizer(
    text,
    return_tensors="pt",   # 返回 PyTorch 张量（tensor），模型需要这种格式
    padding=True,          # 短文本自动填充到统一长度
    truncation=True,       # 超过最大长度自动截断
    max_length=128         # 最大 token 数
)

print(f"Token IDs: {inputs['input_ids']}")
print(f"Token 数量: {inputs['input_ids'].shape[1]}")

# ===== 第 3 步：送入模型推理 =====
with torch.no_grad():               # 推理时不需要计算梯度，节省显存
    outputs = model(**inputs)        # **inputs 展开字典作为参数传入

# outputs.last_hidden_state 是每个 token 的向量表示
# 形状: (batch_size, sequence_length, hidden_size)
# 对 bert-base-uncased 来说 hidden_size = 768
print(f"输出形状: {outputs.last_hidden_state.shape}")
# 输出: torch.Size([1, 14, 768])
# 意思是：1 条文本，14 个 token，每个 token 用 768 维向量表示

# ===== 第 4 步：提取句子级别的向量（常用方法：取 [CLS] token） =====
cls_embedding = outputs.last_hidden_state[:, 0, :]  # 第 0 个 token 就是 [CLS]
print(f"句子向量形状: {cls_embedding.shape}")
# 输出: torch.Size([1, 768])

3. 文本生成（用 GPT 类模型）¶

from transformers import pipeline

# 创建文本生成 pipeline
# 使用 GPT-2 模型（小模型，本地可跑）
generator = pipeline("text-generation", model="gpt2")

# 给一个开头，让模型续写
result = generator(
    "Metagenomics is a powerful tool for",
    max_length=80,          # 生成的最大 token 数
    num_return_sequences=1, # 生成几条结果
    temperature=0.7,        # 控制随机性：越低越确定，越高越随机
    do_sample=True          # 启用采样（否则每次生成一样的）
)

print(result[0]['generated_text'])

手动方式（更灵活）：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# 编码输入
input_text = "The human gut microbiome"
inputs = tokenizer(input_text, return_tensors="pt")

# 生成
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,   # 最多生成 50 个新 token
        temperature=0.7,
        do_sample=True,
        top_p=0.9            # nucleus sampling：只从概率前 90% 的 token 中采样
    )

# 解码回文字
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

4. 文本嵌入（Sentence Transformers）¶

文本嵌入是把文字转成固定长度的向量，可以用来计算文本相似度，是 RAG（检索增强生成）和向量数据库的基础。

from sentence_transformers import SentenceTransformer
import numpy as np

# 加载嵌入模型（会自动下载，约 90MB）
model = SentenceTransformer("all-MiniLM-L6-v2")

# 要编码的句子列表
sentences = [
    "Gut microbiota in type 2 diabetes",            # 句子 A：T2D 肠道菌群
    "Intestinal bacteria and metabolic disease",      # 句子 B：肠道细菌与代谢病（相似）
    "Python is a programming language"                # 句子 C：Python（不相关）
]

# 编码成向量
embeddings = model.encode(sentences)
print(f"向量形状: {embeddings.shape}")
# 输出: (3, 384) → 3 个句子，每个 384 维

# 计算句子之间的余弦相似度
from numpy.linalg import norm

def cosine_sim(a, b):
    """余弦相似度：值越接近 1 越相似，越接近 0 越不相关"""
    return np.dot(a, b) / (norm(a) * norm(b))

print(f"A vs B (都关于肠道菌群): {cosine_sim(embeddings[0], embeddings[1]):.4f}")
print(f"A vs C (完全不同话题):   {cosine_sim(embeddings[0], embeddings[2]):.4f}")
# 预期：A vs B 的相似度远高于 A vs C

和 RAG 的关系：在 RAG 流水线中，你把知识库的每段文字用嵌入模型转成向量存入向量数据库（如 ChromaDB），查询时把用户问题也转成向量，找最相似的文档段落，再送给 LLM 生成回答。

5. 图像分类（用 ViT 模型）¶

Transformers 不只能处理文本，也能处理图像。ViT（Vision Transformer）就是把 Transformer 用在图像上。

from transformers import pipeline
from PIL import Image
import requests

# 创建图像分类 pipeline
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")

# 方法 1：从 URL 加载图像
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat.png"
image = Image.open(requests.get(url, stream=True).raw)

# 分类
results = classifier(image)
for r in results[:3]:  # 显示 Top3 结果
    print(f"  {r['label']}: {r['score']:.4f}")

# 方法 2：从本地文件加载
# image = Image.open("my_image.jpg")
# results = classifier(image)

6. 生信应用：蛋白质语言模型（ESM）¶

ESM（Evolutionary Scale Modeling）是 Meta AI 开发的蛋白质语言模型，能从氨基酸序列中提取深层特征。和你的 T2D 项目关联：肠道菌群产生的蛋白质（如酶、毒素因子）可以用 ESM 提取特征做功能预测。

from transformers import AutoTokenizer, AutoModel
import torch

# ===== 加载 ESM2 模型（8M 参数版，适合 8G 显存） =====
model_name = "facebook/esm2_t6_8M_UR50D"  # 最小版本，适合入门

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# ===== 准备蛋白质序列 =====
# 这是一段示例蛋白质序列（氨基酸单字母编码）
# 实际应用中可以是你的菌群蛋白质序列
protein_sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"

# ===== Tokenize：ESM 的 tokenizer 会把每个氨基酸当作一个 token =====
inputs = tokenizer(
    protein_sequence,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024  # ESM 最大支持 1024 个氨基酸
)

print(f"序列长度: {len(protein_sequence)} 个氨基酸")
print(f"Token 数: {inputs['input_ids'].shape[1]}（包含特殊标记）")

# ===== 推理：提取蛋白质特征 =====
with torch.no_grad():
    outputs = model(**inputs)

# 每个氨基酸的向量表示
per_residue = outputs.last_hidden_state  # (1, seq_len, hidden_dim)
print(f"每个残基的向量: {per_residue.shape}")

# 整条蛋白质的向量（取平均）
protein_embedding = per_residue.mean(dim=1)  # (1, hidden_dim)
print(f"蛋白质整体向量: {protein_embedding.shape}")

# ===== 实际用途 =====
# 1. 蛋白质功能预测：把 protein_embedding 当特征输入随机森林/SVM
# 2. 蛋白质相似性搜索：计算两条蛋白质 embedding 的余弦相似度
# 3. 蛋白质聚类：对多条蛋白质的 embedding 做 PCA/t-SNE 可视化

面试这样说：

"我用 HuggingFace 的 ESM2 模型提取了肠道菌群蛋白质序列的特征向量，然后结合随机森林做了功能分类预测。ESM2 是蛋白质语言模型，类似 BERT 对文本的作用，它对氨基酸序列做了预训练，能捕捉进化保守性等深层特征。"

8G 显存可用的推荐模型¶

模型	参数量	用途	HuggingFace ID
DistilBERT	66M	文本分类、情感分析	`distilbert-base-uncased`
BERT-base	110M	通用 NLP（分类、NER、问答）	`bert-base-uncased`
BERT-base-chinese	110M	中文 NLP 任务	`bert-base-chinese`
GPT-2	124M	英文文本生成	`gpt2`
all-MiniLM-L6-v2	23M	文本嵌入 / 语义搜索	`sentence-transformers/all-MiniLM-L6-v2`
ViT-base	86M	图像分类	`google/vit-base-patch16-224`
ESM2-8M	8M	蛋白质特征提取（入门）	`facebook/esm2_t6_8M_UR50D`
ESM2-150M	150M	蛋白质特征提取（正式）	`facebook/esm2_t30_150M_UR50D`
ESM2-650M	650M	蛋白质特征提取（高精度）	`facebook/esm2_t33_650M_UR50D`
Qwen2.5-0.5B	500M	中文文本生成 / 对话	`Qwen/Qwen2.5-0.5B`

提示：8G 显存大约能跑 1B 参数以内 的模型（FP16 精度下），超过的需要量化（quantization）。

常见报错与解决方案¶

报错信息	原因	解决方法
`ConnectionError: couldn't reach huggingface.co`	网络不通或被墙	设置 `HF_ENDPOINT=https://hf-mirror.com`，或配置代理
`OSError: Can't load tokenizer for 'xxx'`	模型名拼错，或网络中断下载不完整	检查模型 ID 是否正确；删除 `~/.cache/huggingface/` 下的残余文件重新下载
`torch.cuda.OutOfMemoryError: CUDA out of memory`	显存不足	换更小的模型；加 `model.half()` 用 FP16；减小 `batch_size`；加 `device_map="auto"`
`Token indices sequence length is longer than the specified maximum sequence length (512)`	输入文本太长，超过模型最大长度	加 `truncation=True, max_length=512`；或用支持长文本的模型
`ImportError: No module named 'sentencepiece'`	缺少依赖包	`pip install sentencepiece protobuf`
`ValueError: text input must be of type str`	传入了 None 或非字符串类型	检查输入数据，确保是字符串；过滤掉 NaN 值
`RuntimeError: Expected all tensors to be on the same device`	模型在 GPU 上但数据在 CPU 上	统一设备：`inputs = inputs.to("cuda")` 或 `model = model.to("cuda")`
`requests.exceptions.ProxyError`	代理配置错误	检查代理端口是否正确；确认代理软件已启动

HuggingFace 生态全景¶

组件	作用	地址
Model Hub	模型商店，100万+ 预训练模型	huggingface.co/models
Datasets	数据集仓库，20万+ 数据集	huggingface.co/datasets
Spaces	在线 demo，别人部署的模型你可以直接试	huggingface.co/spaces
Gradio	快速做 Web demo 的库（拖拽界面）	gradio.app
Evaluate	标准化评估指标（accuracy、F1、BLEU等）	huggingface.co/evaluate
Tokenizers	高性能分词库（Rust 实现）	huggingface.co/docs/tokenizers
Accelerate	分布式训练 + 混合精度加速	huggingface.co/docs/accelerate
PEFT	参数高效微调（LoRA 等）	huggingface.co/docs/peft
TRL	基于人类反馈的强化学习（RLHF）	huggingface.co/docs/trl

速查表¶

Pipeline 任务类型速查¶

任务	pipeline 参数	输入	输出
情感分析	`"sentiment-analysis"`	文本	正面/负面 + 置信度
文本分类	`"text-classification"`	文本	类别 + 置信度
零样本分类	`"zero-shot-classification"`	文本 + 候选标签	各标签概率
命名实体识别	`"ner"`	文本	实体 + 类型 + 位置
问答	`"question-answering"`	问题 + 上下文	答案 + 置信度
摘要	`"summarization"`	长文本	短摘要
翻译	`"translation"`	文本	翻译结果
文本生成	`"text-generation"`	开头文本	续写文本
填空	`"fill-mask"`	含 [MASK] 的文本	填充词 + 概率
图像分类	`"image-classification"`	图像	类别 + 置信度
目标检测	`"object-detection"`	图像	物体框 + 类别
语音识别	`"automatic-speech-recognition"`	音频	文字

常用模型 ID 速查¶

用途	推荐模型 ID	备注
英文分类	`distilbert-base-uncased`	小快省
中文分类	`bert-base-chinese`	中文任务首选
英文生成	`gpt2`	小模型，本地可跑
中文生成	`Qwen/Qwen2.5-0.5B`	阿里通义千问小版
文本嵌入	`sentence-transformers/all-MiniLM-L6-v2`	轻量高效
英译中	`Helsinki-NLP/opus-mt-en-zh`	翻译专用
蛋白质	`facebook/esm2_t6_8M_UR50D`	ESM2 最小版，入门用
蛋白质（正式）	`facebook/esm2_t33_650M_UR50D`	ESM2 大版，效果更好
图像分类	`google/vit-base-patch16-224`	ViT 基础版
零样本分类	`facebook/bart-large-mnli`	不需要训练直接用

延伸学习资源¶

资源	链接/说明
HuggingFace 官方教程	huggingface.co/learn/nlp-course — 免费 NLP 课程，有中文翻译
Transformers 文档	huggingface.co/docs/transformers — API 参考文档
HuggingFace 论坛	discuss.huggingface.co — 社区问答
ESM 论文	"Language models of protein sequences at the scale of evolution"
本知识库关联	`12_机器学习基础.md` — 随机森林等分类器的原理
本知识库关联	`20_AI与生信交叉应用.md` — AI 在生信中的更多应用
本知识库关联	`21_大语言模型入门.md` — LLM 的原理和使用

小结¶

你现在的位置（学习路线图）：

机器学习基础（已学✓）
    ↓
HuggingFace Transformers 入门（本文✓）
    ├── Pipeline 快速上手 → 用 1 行代码跑通 NLP 任务
    ├── 手动加载模型 → 理解 Tokenizer + Model 的配合
    ├── ESM 蛋白质模型 → 和你的 T2D 项目结合
    ↓
下一步可以学：
    ├── LangChain + RAG（知识库2：02_LangChain入门与RAG实战.md）
    ├── 模型微调（用自己的数据训练模型）
    └── Ollama 本地部署大模型（知识库2：01_Ollama本地大模型部署与使用.md）