467_命名实体识别NER¶

一句话说明¶

NER（Named Entity Recognition）是从文本中识别出人名、地名、机构名、基因名等具体实体的任务——相当于给文本里的关键词贴标签。

核心知识点¶

标注方案：BIO（Begin-Inside-Outside）是最常用标注格式
B-PER：人名开头；I-PER：人名延续；O：非实体
序列标注视角：NER本质是逐token分类问题
嵌套实体：实体内部还有实体（如"北京大学附属医院"含机构和地名）
领域泛化：通用NER（CoNLL）vs 生物医学NER（BioNER）差异大

经典模型/方法¶

方法	优点	缺点	备注
CRF	全局最优标注序列	特征工程复杂	传统基线
BiLSTM-CRF	自动特征+全局解码	速度较慢	深度学习经典
BERT-CRF	强表示+全局解码	资源消耗大	当前主流
BERT-softmax	简单快速	忽略标签依赖	简化版
Span-based	擅长嵌套实体	复杂度O(n²)	嵌套场景

代码示例¶

# 使用 HuggingFace Transformers 实现 BERT-CRF NER
# pip install transformers torch torchcrf

from transformers import BertTokenizerFast, BertModel
from torchcrf import CRF
import torch
import torch.nn as nn

# ---- 1. 定义 BIO 标签 ----
label2id = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-LOC': 3, 'I-LOC': 4}
id2label = {v: k for k, v in label2id.items()}
num_labels = len(label2id)

# ---- 2. 模型定义：BERT + 线性层 + CRF ----
class BertCRF(nn.Module):
    def __init__(self, bert_model_name, num_labels):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        hidden_size = self.bert.config.hidden_size  # 768
        self.classifier = nn.Linear(hidden_size, num_labels)  # 输出每个位置的标签分数
        self.crf = CRF(num_labels, batch_first=True)   # CRF层保证标签序列合法

    def forward(self, input_ids, attention_mask, labels=None):
        # BERT编码每个token
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state  # (B, L, 768)
        emissions = self.classifier(sequence_output)  # (B, L, num_labels)

        if labels is not None:
            # 训练：计算CRF负对数似然作为loss
            loss = -self.crf(emissions, labels, mask=attention_mask.bool())
            return loss
        else:
            # 推理：viterbi解码得到最优标签序列
            preds = self.crf.decode(emissions, mask=attention_mask.bool())
            return preds

# ---- 3. 初始化模型 ----
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = BertCRF('bert-base-chinese', num_labels)

# ---- 4. 推理示例 ----
text = "张三在北京工作"
# 使用fast tokenizer可获取word到token的映射
enc = tokenizer(list(text), is_split_into_words=True,
                return_tensors='pt', padding=True)

model.eval()
with torch.no_grad():
    preds = model(enc['input_ids'], enc['attention_mask'])

# preds是嵌套list：[[标签id, 标签id, ...]]
labels_pred = [id2label[i] for i in preds[0]]
print(list(zip(text, labels_pred[1:-1])))  # 去掉[CLS][SEP]

面试常问点¶

为什么要用CRF而不是直接softmax？
CRF建模标签间转移概率，避免"I-PER出现在O之后"等非法序列
BIO vs BIOES有什么区别？
BIOES多了S（单字实体）和E（实体末尾），对单字实体更准确，但标签集更大
如何处理subword tokenization导致的对齐问题？
使用fast tokenizer的word_ids()映射，只保留每个词的第一个subtoken的预测
生物医学NER有哪些特殊挑战？
实体名高度专业（如基因名）、嵌套频繁、数据稀少

速查表¶

任务	推荐工具
中文通用NER	bert-base-chinese + CRF
英文通用NER	bert-base-cased + CRF
生物医学NER	BioBERT / PubMedBERT
快速标注数据	Label Studio / doccano
评估指标	seqeval（F1 by entity span）