480_基因本体文本挖掘¶

一句话说明¶

基因本体（Gene Ontology, GO）文本挖掘从生物医学文献中自动提取基因功能注释信息，辅助GO数据库的手工注释过程，是生物信息学与NLP的交叉应用。

核心知识点¶

基因本体（GO）：描述基因产物功能的结构化词汇表，分三个子本体：
MF（Molecular Function）：分子功能，如"酶活性"
BP（Biological Process）：生物过程，如"细胞凋亡"
CC（Cellular Component）：细胞组分，如"线粒体"
GO注释：每个基因-GO术语对都有证据码（IDA=实验直接证据，IEA=电子推断）
文本挖掘任务：基因NER + GO术语NER + 基因-GO关联抽取
挑战：GO术语层级复杂（DAG结构），文献中描述与GO术语不完全一致

GO文本挖掘相关工具与方法¶

工具/方法	功能	特点
EXTRACT	实体识别+标准化	支持基因/疾病/GO术语
EVEX	事件抽取（基因调控）	关注信号传导事件
STRING text mining	蛋白质共现挖掘	预计算好的共现网络
BioBERT fine-tune	监督分类GO类型	高精度，需标注数据
GO-BP预测	从序列/文献预测GO	结合序列和文本特征

代码示例¶

# ---- 1. 解析GO本体OBO文件 ----
# pip install goatools
from goatools.obo_parser import GODag

# 下载GO本体文件
# wget http://geneontology.org/ontology/go-basic.obo

godag = GODag("go-basic.obo")  # 解析GO有向无环图

# 查看一个GO术语的详细信息
go_id = "GO:0006915"  # Apoptotic process（细胞凋亡）
if go_id in godag:
    term = godag[go_id]
    print(f"GO ID: {term.item_id}")
    print(f"名称: {term.name}")
    print(f"命名空间: {term.namespace}")  # biological_process
    print(f"定义: {term.defn[:100]}...")
    print(f"父节点: {[p.item_id for p in term.parents]}")
    print(f"子节点数: {len(term.children)}")

# ---- 2. GO注释富集分析 ----
from goatools.go_enrichment import GOEnrichmentStudy
import gzip

# 加载GO注释文件（GAF格式）
# wget http://current.geneontology.org/annotations/goa_human.gaf.gz
def load_gaf_annotations(gaf_file):
    """解析GAF注释文件，返回基因→GO集合的映射"""
    gene2go = {}
    opener = gzip.open if gaf_file.endswith('.gz') else open
    with opener(gaf_file, 'rt') as f:
        for line in f:
            if line.startswith('!'):
                continue  # 跳过注释行
            fields = line.strip().split('\t')
            if len(fields) < 7:
                continue
            gene_symbol = fields[2]  # 基因符号
            go_term = fields[4]      # GO术语ID
            evidence = fields[6]     # 证据码
            if evidence != 'IEA':    # 排除纯电子推断注释
                gene2go.setdefault(gene_symbol, set()).add(go_term)
    return gene2go

# ---- 3. 从文献摘要中识别GO相关描述 ----
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 将摘要分类到GO的三个子本体
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 假设已经有GO子本体分类模型（需自行微调）
def classify_go_namespace(text, model, tokenizer):
    """预测文本描述的是哪个GO子本体"""
    enc = tokenizer(text, return_tensors='pt',
                    max_length=512, truncation=True, padding='max_length')
    with torch.no_grad():
        logits = model(**enc).logits
        pred = logits.argmax(dim=-1).item()
    go_namespaces = {0: 'Molecular Function', 1: 'Biological Process', 2: 'Cellular Component'}
    return go_namespaces[pred]

# ---- 4. GO术语文本相似度匹配 ----
from sentence_transformers import SentenceTransformer, util

# 用语义相似度将文本描述匹配到最近的GO术语
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# GO术语数据库（实际使用完整GO本体）
go_terms = {
    "GO:0006915": "apoptotic process - programmed cell death",
    "GO:0006260": "DNA replication - copying of DNA",
    "GO:0007049": "cell cycle - progression through mitosis and meiosis",
    "GO:0030154": "cell differentiation - specialization of cells",
    "GO:0016310": "phosphorylation - addition of phosphate group",
}

go_ids = list(go_terms.keys())
go_descriptions = list(go_terms.values())
go_embeddings = model.encode(go_descriptions, convert_to_tensor=True)

def find_matching_go_terms(text, go_embeddings, go_ids, go_descriptions, top_k=3):
    """将文本描述匹配到最相似的GO术语"""
    text_emb = model.encode(text, convert_to_tensor=True)
    cos_scores = util.cos_sim(text_emb, go_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)
    results = []
    for score, idx in zip(top_results.values, top_results.indices):
        results.append({
            'go_id': go_ids[idx],
            'description': go_descriptions[idx],
            'similarity': score.item()
        })
    return results

# 测试：从摘要句子匹配GO术语
query_text = "The protein promotes programmed cell death in cancer cells"
matches = find_matching_go_terms(query_text, go_embeddings, go_ids, go_descriptions)
print("\nGO术语匹配结果:")
for m in matches:
    print(f"  {m['go_id']}: {m['description'][:50]} (相似度: {m['similarity']:.4f})")

# ---- 5. 利用PubMed Entrez获取相关文献 ----
from Bio import Entrez, Medline

Entrez.email = "your_email@example.com"  # NCBI要求提供邮件

def search_pubmed_for_gene_go(gene_name, go_term, max_results=5):
    """搜索与特定基因和GO术语相关的PubMed文献"""
    query = f'{gene_name}[Gene Name] AND "{go_term}"[MeSH Terms]'
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    record = Entrez.read(handle)
    pmids = record['IdList']

    if not pmids:
        return []

    # 获取摘要
    handle = Entrez.efetch(db="pubmed", id=pmids, rettype="medline", retmode="text")
    records = list(Medline.parse(handle))
    return records

# results = search_pubmed_for_gene_go("BRCA1", "DNA repair")

面试常问点¶

GO注释的证据码（Evidence Code）有哪些类型？
IDA（Inferred from Direct Assay）：最可靠，实验直接证据
IPI（Inferred from Physical Interaction）：物理互作推断
ISS（Inferred from Sequence Similarity）：序列相似性推断
IEA（Inferred from Electronic Annotation）：自动电子注释，最不可靠
GO富集分析和GO文本挖掘的区别？
富集分析：给定基因列表，统计哪些GO术语被显著富集（统计学方法）
文本挖掘：从文献中自动提取基因-GO关联（NLP方法）
GO本体的DAG结构有什么意义？
允许注释传播：如果基因参与"apoptosis"，则自动也注释其父术语"cell death"
注释一致性检查：子-父关系必须合理
文本挖掘辅助GO注释的主要挑战？
GO术语复杂（有10万+个术语）；文献描述与GO标准定义措辞不同；需要区分正负例

速查表¶

任务	工具
解析GO本体	goatools
富集分析	goatools / clusterProfiler（R）
PubMed检索	BioPython Entrez
实体识别	EXTRACT / TaggerOne
预训练模型	PubMedBERT / BioBERT

480_基因本体文本挖掘¶

一句话说明¶

核心知识点¶

GO文本挖掘相关工具与方法¶

代码示例¶

面试常问点¶

速查表¶

📚 相关文章推荐