跳转至

480_基因本体文本挖掘


一句话说明

基因本体(Gene Ontology, GO)文本挖掘从生物医学文献中自动提取基因功能注释信息,辅助GO数据库的手工注释过程,是生物信息学与NLP的交叉应用。


核心知识点

  • 基因本体(GO):描述基因产物功能的结构化词汇表,分三个子本体:
  • MF(Molecular Function):分子功能,如"酶活性"
  • BP(Biological Process):生物过程,如"细胞凋亡"
  • CC(Cellular Component):细胞组分,如"线粒体"
  • GO注释:每个基因-GO术语对都有证据码(IDA=实验直接证据,IEA=电子推断)
  • 文本挖掘任务:基因NER + GO术语NER + 基因-GO关联抽取
  • 挑战:GO术语层级复杂(DAG结构),文献中描述与GO术语不完全一致

GO文本挖掘相关工具与方法

工具/方法功能特点
EXTRACT实体识别+标准化支持基因/疾病/GO术语
EVEX事件抽取(基因调控)关注信号传导事件
STRING text mining蛋白质共现挖掘预计算好的共现网络
BioBERT fine-tune监督分类GO类型高精度,需标注数据
GO-BP预测从序列/文献预测GO结合序列和文本特征

代码示例

# ---- 1. 解析GO本体OBO文件 ----
# pip install goatools
from goatools.obo_parser import GODag

# 下载GO本体文件
# wget http://geneontology.org/ontology/go-basic.obo

godag = GODag("go-basic.obo")  # 解析GO有向无环图

# 查看一个GO术语的详细信息
go_id = "GO:0006915"  # Apoptotic process(细胞凋亡)
if go_id in godag:
    term = godag[go_id]
    print(f"GO ID: {term.item_id}")
    print(f"名称: {term.name}")
    print(f"命名空间: {term.namespace}")  # biological_process
    print(f"定义: {term.defn[:100]}...")
    print(f"父节点: {[p.item_id for p in term.parents]}")
    print(f"子节点数: {len(term.children)}")

# ---- 2. GO注释富集分析 ----
from goatools.go_enrichment import GOEnrichmentStudy
import gzip

# 加载GO注释文件(GAF格式)
# wget http://current.geneontology.org/annotations/goa_human.gaf.gz
def load_gaf_annotations(gaf_file):
    """解析GAF注释文件,返回基因→GO集合的映射"""
    gene2go = {}
    opener = gzip.open if gaf_file.endswith('.gz') else open
    with opener(gaf_file, 'rt') as f:
        for line in f:
            if line.startswith('!'):
                continue  # 跳过注释行
            fields = line.strip().split('\t')
            if len(fields) < 7:
                continue
            gene_symbol = fields[2]  # 基因符号
            go_term = fields[4]      # GO术语ID
            evidence = fields[6]     # 证据码
            if evidence != 'IEA':    # 排除纯电子推断注释
                gene2go.setdefault(gene_symbol, set()).add(go_term)
    return gene2go

# ---- 3. 从文献摘要中识别GO相关描述 ----
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 将摘要分类到GO的三个子本体
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 假设已经有GO子本体分类模型(需自行微调)
def classify_go_namespace(text, model, tokenizer):
    """预测文本描述的是哪个GO子本体"""
    enc = tokenizer(text, return_tensors='pt',
                    max_length=512, truncation=True, padding='max_length')
    with torch.no_grad():
        logits = model(**enc).logits
        pred = logits.argmax(dim=-1).item()
    go_namespaces = {0: 'Molecular Function', 1: 'Biological Process', 2: 'Cellular Component'}
    return go_namespaces[pred]

# ---- 4. GO术语文本相似度匹配 ----
from sentence_transformers import SentenceTransformer, util

# 用语义相似度将文本描述匹配到最近的GO术语
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# GO术语数据库(实际使用完整GO本体)
go_terms = {
    "GO:0006915": "apoptotic process - programmed cell death",
    "GO:0006260": "DNA replication - copying of DNA",
    "GO:0007049": "cell cycle - progression through mitosis and meiosis",
    "GO:0030154": "cell differentiation - specialization of cells",
    "GO:0016310": "phosphorylation - addition of phosphate group",
}

go_ids = list(go_terms.keys())
go_descriptions = list(go_terms.values())
go_embeddings = model.encode(go_descriptions, convert_to_tensor=True)

def find_matching_go_terms(text, go_embeddings, go_ids, go_descriptions, top_k=3):
    """将文本描述匹配到最相似的GO术语"""
    text_emb = model.encode(text, convert_to_tensor=True)
    cos_scores = util.cos_sim(text_emb, go_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)
    results = []
    for score, idx in zip(top_results.values, top_results.indices):
        results.append({
            'go_id': go_ids[idx],
            'description': go_descriptions[idx],
            'similarity': score.item()
        })
    return results

# 测试:从摘要句子匹配GO术语
query_text = "The protein promotes programmed cell death in cancer cells"
matches = find_matching_go_terms(query_text, go_embeddings, go_ids, go_descriptions)
print("\nGO术语匹配结果:")
for m in matches:
    print(f"  {m['go_id']}: {m['description'][:50]} (相似度: {m['similarity']:.4f})")

# ---- 5. 利用PubMed Entrez获取相关文献 ----
from Bio import Entrez, Medline

Entrez.email = "your_email@example.com"  # NCBI要求提供邮件

def search_pubmed_for_gene_go(gene_name, go_term, max_results=5):
    """搜索与特定基因和GO术语相关的PubMed文献"""
    query = f'{gene_name}[Gene Name] AND "{go_term}"[MeSH Terms]'
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    record = Entrez.read(handle)
    pmids = record['IdList']

    if not pmids:
        return []

    # 获取摘要
    handle = Entrez.efetch(db="pubmed", id=pmids, rettype="medline", retmode="text")
    records = list(Medline.parse(handle))
    return records

# results = search_pubmed_for_gene_go("BRCA1", "DNA repair")

面试常问点

  1. GO注释的证据码(Evidence Code)有哪些类型?
  2. IDA(Inferred from Direct Assay):最可靠,实验直接证据
  3. IPI(Inferred from Physical Interaction):物理互作推断
  4. ISS(Inferred from Sequence Similarity):序列相似性推断
  5. IEA(Inferred from Electronic Annotation):自动电子注释,最不可靠

  6. GO富集分析和GO文本挖掘的区别?

  7. 富集分析:给定基因列表,统计哪些GO术语被显著富集(统计学方法)
  8. 文本挖掘:从文献中自动提取基因-GO关联(NLP方法)

  9. GO本体的DAG结构有什么意义?

  10. 允许注释传播:如果基因参与"apoptosis",则自动也注释其父术语"cell death"
  11. 注释一致性检查:子-父关系必须合理

  12. 文本挖掘辅助GO注释的主要挑战?

  13. GO术语复杂(有10万+个术语);文献描述与GO标准定义措辞不同;需要区分正负例

速查表

任务工具
解析GO本体goatools
富集分析goatools / clusterProfiler(R)
PubMed检索BioPython Entrez
实体识别EXTRACT / TaggerOne
预训练模型PubMedBERT / BioBERT