跳转至

478_生物医学NLP_BioNER


一句话说明

生物医学NLP(BioNLP)把自然语言处理技术应用于生物医学文献,BioNER专门从文章中识别基因、蛋白质、疾病、化合物等生物医学实体。


核心知识点

  • 生物医学实体类型:基因/蛋白质、疾病、化合物/药物、物种、细胞系、SNP变异
  • 主要数据集
  • NCBI Disease:疾病NER
  • BC5CDR:化合物+疾病NER
  • JNLPBA:基因/蛋白质/DNA/RNA
  • BioCreative VI:化学-蛋白质关系
  • 领域差异:生物医学文本词汇高度专业,普通NLP模型表现差,需专业预训练
  • 挑战:实体名歧义(BRCA1既是基因也是蛋白质)、缩写、嵌套实体

生物医学预训练模型对比

模型预训练语料特点
BioBERTPubMed摘要+PMC全文最早生物医学BERT
PubMedBERT仅PubMed(无通用语料)专域预训练,效果更好
SciBERT语义学者科学论文多科学领域
BioLinkBERTPubMed+文档链接关系捕捉文献引用结构
BioGPTPubMed生成式模型生物医学文本生成/QA

代码示例

# ---- 1. 使用 BioBERT 进行生物医学NER ----
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# 加载专门用于基因/疾病/化合物NER的BioBERT微调模型
ner_model_name = "pruas/BENT-PubMedBERT-NER-Gene"  # 基因NER模型

tokenizer = AutoTokenizer.from_pretrained(ner_model_name)
model = AutoModelForTokenClassification.from_pretrained(ner_model_name)

# 创建NER流水线
bio_ner = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # 自动合并跨subtoken的实体
)

# 生物医学文本示例
bio_text = """
The BRCA1 gene is associated with increased risk of breast cancer.
Tamoxifen is commonly used to treat estrogen receptor-positive breast cancer.
TP53 mutations are found in approximately 50% of human cancers.
"""

entities = bio_ner(bio_text)
print("识别到的生物医学实体:")
for ent in entities:
    print(f"  [{ent['entity_group']}] {ent['word']} (置信度: {ent['score']:.3f})")

# ---- 2. 使用 BERN2 API 进行多类型BioNER ----
import requests

def bern2_ner(text):
    """调用 BERN2 在线API(多类实体:基因/疾病/化合物/物种/突变)"""
    url = "http://bern2.korea.ac.kr/plain"
    response = requests.post(url, json={"text": text})
    result = response.json()
    for ann in result.get('annotations', []):
        entity_text = text[ann['span']['begin']:ann['span']['end']]
        entity_types = [t['name'] for t in ann.get('obj', [])]
        print(f"  {entity_text}: {entity_types}")

# ---- 3. PubMedBERT 微调 BioNER(自定义数据集)----
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from seqeval.metrics import f1_score

# 加载BC5CDR数据集(化合物+疾病NER的标准benchmark)
dataset = load_dataset("ncats/bc5cdr")

# BC5CDR标签
label_list = dataset["train"].features["ner_tags"].feature.names
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}

# 加载PubMedBERT(专域预训练,无通用混合)
model_checkpoint = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

def tokenize_and_align(examples):
    """分词并对齐NER标签(处理wordpiece切割)"""
    tokenized = tokenizer(
        examples["tokens"],
        is_split_into_words=True,   # 输入已分词
        truncation=True,
        max_length=512
    )
    labels = []
    for i, label_ids in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        previous_word_idx = None
        label_list_aligned = []
        for word_idx in word_ids:
            if word_idx is None:
                label_list_aligned.append(-100)  # 特殊token忽略
            elif word_idx != previous_word_idx:
                label_list_aligned.append(label_ids[word_idx])  # 词的第一个subtoken
            else:
                label_list_aligned.append(-100)  # 词的后续subtoken忽略
            previous_word_idx = word_idx
        labels.append(label_list_aligned)
    tokenized["labels"] = labels
    return tokenized

# ---- 4. 基于词典的快速BioNER(规则方法)----
def dict_based_ner(text, gene_dict, disease_dict):
    """
    基于词典的生物医学实体识别(简单但快速)
    gene_dict: 基因名集合
    disease_dict: 疾病名集合
    """
    results = []
    words = text.split()
    for i, word in enumerate(words):
        word_clean = word.strip('.,;')
        if word_clean in gene_dict:
            results.append({'word': word_clean, 'type': 'Gene', 'pos': i})
        elif word_clean in disease_dict:
            results.append({'word': word_clean, 'type': 'Disease', 'pos': i})
    return results

# 示例词典(实际使用NCBI Gene、MeSH等专业数据库)
genes = {"BRCA1", "TP53", "EGFR", "KRAS", "HER2"}
diseases = {"cancer", "diabetes", "Alzheimer", "hypertension"}
text = "BRCA1 mutations are linked to breast cancer risk."
entities = dict_based_ner(text, genes, diseases)
for ent in entities:
    print(f"  [{ent['type']}] {ent['word']}")

面试常问点

  1. PubMedBERT和BioBERT的主要区别?
  2. BioBERT:从通用BERT继续预训练(混合通用+生物语料)
  3. PubMedBERT:从零开始只用PubMed预训练,避免通用域噪声,效果更好

  4. 生物医学NER比通用NER难在哪里?

  5. 实体名高度多样(同一蛋白质有数十种别名)
  6. 嵌套实体频繁("人类肺癌细胞系"中有物种、器官、疾病、细胞系)
  7. 缩写歧义(IL-6可能是基因也可能是蛋白质)

  8. 如何评估BioNER?

  9. 使用seqeval计算按实体span的F1(而非按token),主流benchmark:BC5CDR、NCBI Disease

  10. BioNER的下游应用?

  11. 文献挖掘、药物靶点发现、基因-疾病关联挖掘、临床决策支持

速查表

实体类型推荐模型/工具
基因/蛋白质BioBERT-NER / BERN2
疾病PubMedBERT + NCBI Disease数据集
化合物/药物SciBERT + BC5CDR
多类型BERN2 API
快速词典匹配FlashText / Aho-Corasick