480_基因本体文本挖掘¶
一句话说明¶
基因本体(Gene Ontology, GO)文本挖掘从生物医学文献中自动提取基因功能注释信息,辅助GO数据库的手工注释过程,是生物信息学与NLP的交叉应用。
核心知识点¶
- 基因本体(GO):描述基因产物功能的结构化词汇表,分三个子本体:
- MF(Molecular Function):分子功能,如"酶活性"
- BP(Biological Process):生物过程,如"细胞凋亡"
- CC(Cellular Component):细胞组分,如"线粒体"
- GO注释:每个基因-GO术语对都有证据码(IDA=实验直接证据,IEA=电子推断)
- 文本挖掘任务:基因NER + GO术语NER + 基因-GO关联抽取
- 挑战:GO术语层级复杂(DAG结构),文献中描述与GO术语不完全一致
GO文本挖掘相关工具与方法¶
| 工具/方法 | 功能 | 特点 |
|---|---|---|
| EXTRACT | 实体识别+标准化 | 支持基因/疾病/GO术语 |
| EVEX | 事件抽取(基因调控) | 关注信号传导事件 |
| STRING text mining | 蛋白质共现挖掘 | 预计算好的共现网络 |
| BioBERT fine-tune | 监督分类GO类型 | 高精度,需标注数据 |
| GO-BP预测 | 从序列/文献预测GO | 结合序列和文本特征 |
代码示例¶
# ---- 1. 解析GO本体OBO文件 ----
# pip install goatools
from goatools.obo_parser import GODag
# 下载GO本体文件
# wget http://geneontology.org/ontology/go-basic.obo
godag = GODag("go-basic.obo") # 解析GO有向无环图
# 查看一个GO术语的详细信息
go_id = "GO:0006915" # Apoptotic process(细胞凋亡)
if go_id in godag:
term = godag[go_id]
print(f"GO ID: {term.item_id}")
print(f"名称: {term.name}")
print(f"命名空间: {term.namespace}") # biological_process
print(f"定义: {term.defn[:100]}...")
print(f"父节点: {[p.item_id for p in term.parents]}")
print(f"子节点数: {len(term.children)}")
# ---- 2. GO注释富集分析 ----
from goatools.go_enrichment import GOEnrichmentStudy
import gzip
# 加载GO注释文件(GAF格式)
# wget http://current.geneontology.org/annotations/goa_human.gaf.gz
def load_gaf_annotations(gaf_file):
"""解析GAF注释文件,返回基因→GO集合的映射"""
gene2go = {}
opener = gzip.open if gaf_file.endswith('.gz') else open
with opener(gaf_file, 'rt') as f:
for line in f:
if line.startswith('!'):
continue # 跳过注释行
fields = line.strip().split('\t')
if len(fields) < 7:
continue
gene_symbol = fields[2] # 基因符号
go_term = fields[4] # GO术语ID
evidence = fields[6] # 证据码
if evidence != 'IEA': # 排除纯电子推断注释
gene2go.setdefault(gene_symbol, set()).add(go_term)
return gene2go
# ---- 3. 从文献摘要中识别GO相关描述 ----
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# 将摘要分类到GO的三个子本体
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 假设已经有GO子本体分类模型(需自行微调)
def classify_go_namespace(text, model, tokenizer):
"""预测文本描述的是哪个GO子本体"""
enc = tokenizer(text, return_tensors='pt',
max_length=512, truncation=True, padding='max_length')
with torch.no_grad():
logits = model(**enc).logits
pred = logits.argmax(dim=-1).item()
go_namespaces = {0: 'Molecular Function', 1: 'Biological Process', 2: 'Cellular Component'}
return go_namespaces[pred]
# ---- 4. GO术语文本相似度匹配 ----
from sentence_transformers import SentenceTransformer, util
# 用语义相似度将文本描述匹配到最近的GO术语
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# GO术语数据库(实际使用完整GO本体)
go_terms = {
"GO:0006915": "apoptotic process - programmed cell death",
"GO:0006260": "DNA replication - copying of DNA",
"GO:0007049": "cell cycle - progression through mitosis and meiosis",
"GO:0030154": "cell differentiation - specialization of cells",
"GO:0016310": "phosphorylation - addition of phosphate group",
}
go_ids = list(go_terms.keys())
go_descriptions = list(go_terms.values())
go_embeddings = model.encode(go_descriptions, convert_to_tensor=True)
def find_matching_go_terms(text, go_embeddings, go_ids, go_descriptions, top_k=3):
"""将文本描述匹配到最相似的GO术语"""
text_emb = model.encode(text, convert_to_tensor=True)
cos_scores = util.cos_sim(text_emb, go_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append({
'go_id': go_ids[idx],
'description': go_descriptions[idx],
'similarity': score.item()
})
return results
# 测试:从摘要句子匹配GO术语
query_text = "The protein promotes programmed cell death in cancer cells"
matches = find_matching_go_terms(query_text, go_embeddings, go_ids, go_descriptions)
print("\nGO术语匹配结果:")
for m in matches:
print(f" {m['go_id']}: {m['description'][:50]} (相似度: {m['similarity']:.4f})")
# ---- 5. 利用PubMed Entrez获取相关文献 ----
from Bio import Entrez, Medline
Entrez.email = "your_email@example.com" # NCBI要求提供邮件
def search_pubmed_for_gene_go(gene_name, go_term, max_results=5):
"""搜索与特定基因和GO术语相关的PubMed文献"""
query = f'{gene_name}[Gene Name] AND "{go_term}"[MeSH Terms]'
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
record = Entrez.read(handle)
pmids = record['IdList']
if not pmids:
return []
# 获取摘要
handle = Entrez.efetch(db="pubmed", id=pmids, rettype="medline", retmode="text")
records = list(Medline.parse(handle))
return records
# results = search_pubmed_for_gene_go("BRCA1", "DNA repair")
面试常问点¶
- GO注释的证据码(Evidence Code)有哪些类型?
- IDA(Inferred from Direct Assay):最可靠,实验直接证据
- IPI(Inferred from Physical Interaction):物理互作推断
- ISS(Inferred from Sequence Similarity):序列相似性推断
IEA(Inferred from Electronic Annotation):自动电子注释,最不可靠
GO富集分析和GO文本挖掘的区别?
- 富集分析:给定基因列表,统计哪些GO术语被显著富集(统计学方法)
文本挖掘:从文献中自动提取基因-GO关联(NLP方法)
GO本体的DAG结构有什么意义?
- 允许注释传播:如果基因参与"apoptosis",则自动也注释其父术语"cell death"
注释一致性检查:子-父关系必须合理
文本挖掘辅助GO注释的主要挑战?
- GO术语复杂(有10万+个术语);文献描述与GO标准定义措辞不同;需要区分正负例
速查表¶
| 任务 | 工具 |
|---|---|
| 解析GO本体 | goatools |
| 富集分析 | goatools / clusterProfiler(R) |
| PubMed检索 | BioPython Entrez |
| 实体识别 | EXTRACT / TaggerOne |
| 预训练模型 | PubMedBERT / BioBERT |