跳转至

479_临床NLP与电子病历挖掘


一句话说明

临床NLP从电子病历(EHR)的非结构化文本(医生笔记、出院小结、病理报告)中自动提取结构化临床信息,是医疗AI的核心基础。


核心知识点

  • 非结构化临床文本:护理记录、放射报告、手术记录、出院摘要——占医疗数据80%以上
  • 临床实体类型:症状、诊断、用药、剂量、检查结果、身体部位、时间信息
  • 否定检测:临床文本大量否定表述("无发热"、"未见肿块")需特别处理
  • HIPAA合规:患者数据必须去隐私化(PHI去标识)再用于研究
  • 主要挑战:缩写(HTN=高血压)、非标准拼写、领域专业性强

临床NLP工具与模型对比

工具/模型特点适用场景
cTAKESApache开源,完整临床NLP流水线企业级部署
MetaMapNLM官方,映射到UMLS概念临床编码标准化
ClinicalBERT临床记录预训练BERT临床文本分类
BioBERTPubMed预训练临床NER
medspaCyspaCy的临床扩展快速原型

代码示例

# ---- 1. medspaCy 临床NLP流水线 ----
# pip install medspacy spacy
# python -m spacy download en_core_web_sm

import medspacy
from medspacy.ner import TargetMatcher

# 创建临床NLP流水线
nlp = medspacy.load()  # 包含分句、否定检测等组件

# 添加自定义临床实体规则
target_matcher = nlp.get_pipe("medspacy_target_matcher")

# 定义要识别的临床实体(基于规则的方法)
from medspacy.ner import TargetRule
target_rules = [
    TargetRule("fever", "SYMPTOM"),           # 发热=症状
    TargetRule("hypertension", "CONDITION"),  # 高血压=疾病
    TargetRule("HTN", "CONDITION"),           # 高血压缩写
    TargetRule("diabetes", "CONDITION"),
    TargetRule("metformin", "MEDICATION"),    # 二甲双胍=药物
    TargetRule("aspirin", "MEDICATION"),
]
target_matcher.add(target_rules)

# 处理临床文本
clinical_note = """
Patient presents with fever and cough. 
No chest pain or shortness of breath.
History of hypertension and diabetes.
Currently on metformin 500mg daily.
Denied any aspirin use.
"""

doc = nlp(clinical_note)

# 提取实体及其否定状态
print("提取的临床实体:")
for ent in doc.ents:
    is_negated = ent._.is_negated  # medspaCy自动检测否定
    print(f"  [{ent.label_}] '{ent.text}' | 否定: {is_negated}")

# ---- 2. 使用 ClinicalBERT 做临床文本分类 ----
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ClinicalBERT:在MIMIC-III出院摘要上预训练
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # 例:预测30天再入院风险
)

def predict_readmission(discharge_summary):
    """预测30天再入院风险"""
    enc = tokenizer(
        discharge_summary,
        return_tensors='pt',
        max_length=512,
        truncation=True,
        padding='max_length'
    )
    with torch.no_grad():
        logits = model(**enc).logits
        prob = torch.softmax(logits, dim=-1)[0, 1].item()  # 再入院概率
    return prob

summary = "Patient discharged with improved diabetes control. Follow up in 2 weeks."
risk = predict_readmission(summary)
print(f"\n30天再入院风险: {risk:.3f}")

# ---- 3. PHI去标识化(隐私保护)----
# HIPAA定义的18类PHI(Protected Health Information)
import re

def deidentify_phi(text):
    """
    简单规则去标识化(生产环境需使用专业工具如Philter、DeID)
    """
    # 去除姓名(简化版,实际需NER辅助)
    text = re.sub(r'\b(?:Dr\.|Mr\.|Mrs\.|Ms\.)?\s*[A-Z][a-z]+ [A-Z][a-z]+\b',
                  '[NAME]', text)
    # 去除日期
    text = re.sub(r'\b\d{1,2}/\d{1,2}/\d{2,4}\b', '[DATE]', text)
    # 去除电话号码
    text = re.sub(r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', '[PHONE]', text)
    # 去除SSN(美国社保号)
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # 去除邮件
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[EMAIL]', text)
    return text

phi_text = "Patient John Smith, DOB 03/15/1980, SSN 123-45-6789, called 555-123-4567."
clean_text = deidentify_phi(phi_text)
print(f"\n去标识化结果:\n原文: {phi_text}\n处理后: {clean_text}")

# ---- 4. ICD编码自动化(临床编码辅助)----
from transformers import pipeline

# 自动将临床描述映射到ICD-10诊断编码
# 实际使用可基于PLM + 检索或生成方法
def simple_icd_lookup(description, icd_dict):
    """基于字符串匹配的简单ICD编码查找"""
    desc_lower = description.lower()
    for code, name in icd_dict.items():
        if any(keyword in desc_lower for keyword in name.lower().split()):
            return code, name
    return "Z00.0", "General examination"

icd10_sample = {
    "E11": "Type 2 diabetes mellitus",
    "I10": "Essential (primary) hypertension",
    "J18.9": "Pneumonia, unspecified organism",
    "K21.0": "Gastro-esophageal reflux disease",
}

diagnosis = "Patient diagnosed with type 2 diabetes mellitus"
code, name = simple_icd_lookup(diagnosis, icd10_sample)
print(f"\nICD编码: {code} - {name}")

面试常问点

  1. 临床NLP和生物医学NLP有什么区别?
  2. 生物医学NLP:主要处理科学文献(PubMed)
  3. 临床NLP:处理实际临床记录(EHR),包含大量缩写、非正式表达、否定

  4. 否定检测在临床NLP中为什么重要?

  5. "无发热"和"发热"截然相反;不处理否定会导致误判(把"否认胸痛"算作有胸痛)

  6. UMLS是什么?

  7. Unified Medical Language System:统一医学语言系统,整合了SNOMED CT、ICD等多个医学术语标准的超级词汇表

  8. MIMIC-III数据集是什么?

  9. MIT发布的大型重症监护电子病历数据集,是临床NLP研究最重要的公开数据集

速查表

任务工具
临床NERmedspaCy / cTAKES
否定检测NegEx / medspaCy
临床预训练模型ClinicalBERT / BioClinicalBERT
PHI去标识Philter / Microsoft Presidio
概念标准化MetaMap / QuickUMLS