193_蛋白质domain分析InterPro¶

一句话概述¶

InterPro整合了15+个蛋白质域/家族数据库（Pfam、CDD、SMART、PANTHER等），通过InterProScan工具对蛋白质序列进行功能域注释，是蛋白质功能预测和基因组注释的标准方法。

核心知识点表格¶

知识点	说明
InterPro	整合多个蛋白质签名数据库的元数据库
Pfam	最常用的蛋白质域HMM数据库（已并入InterPro）
InterProScan	本地化的蛋白质序列注释工具
蛋白质域(Domain)	蛋白质中独立折叠和功能的结构单元
GO注释	InterPro条目关联的基因本体论注释
签名(Signature)	用于识别蛋白质特征的序列模式或模型
HMM模型	隐马尔可夫模型，Pfam的核心方法
应用场景	基因组注释、功能预测、进化分析

步骤详解¶

第一步：使用InterProScan注释蛋白质¶

白话解释：InterProScan就像一个"功能识别器"，把你的蛋白质序列和数据库中已知的功能域模式进行比对，告诉你蛋白质包含哪些功能域，可能具有什么功能。

# 安装InterProScan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.68-100.0/interproscan-5.68-100.0-64-bit.tar.gz
tar xzf interproscan-5.68-100.0-64-bit.tar.gz
cd interproscan-5.68-100.0

# 运行InterProScan
./interproscan.sh \
    -i proteins.fasta \
    -o interproscan_results.tsv \
    -f tsv,gff3,xml \
    -goterms \
    -iprlookup \
    -pa \
    -cpu 16 \
    -dp \
    -appl Pfam,CDD,SMART,SUPERFAMILY,Gene3D,PANTHER

# 参数说明
# -goterms: 输出GO注释
# -iprlookup: 将签名映射到InterPro条目
# -pa: 输出通路注释
# -dp: 禁用前置条件检查
# -appl: 指定运行的分析应用

第二步：解读InterProScan输出¶

import pandas as pd

# 读取TSV输出
columns = ['Protein', 'MD5', 'Length', 'Analysis', 'SignatureAcc',
           'SignatureDesc', 'Start', 'Stop', 'Score', 'Status',
           'Date', 'InterProAcc', 'InterProDesc', 'GOAnnotations', 'Pathways']

ipr = pd.read_csv("interproscan_results.tsv", sep="\t", header=None,
                    names=columns, na_values='-')

# 基本统计
print(f"注释的蛋白质数: {ipr['Protein'].nunique()}")
print(f"检测到的域: {ipr['SignatureAcc'].nunique()}")
print(f"\n各数据库贡献:")
print(ipr['Analysis'].value_counts())

# Top InterPro条目
print(f"\n最常见的InterPro条目:")
print(ipr['InterProDesc'].value_counts().head(15))

# 提取GO注释
go_annotations = ipr[ipr['GOAnnotations'].notna()][['Protein', 'GOAnnotations']]
go_list = []
for _, row in go_annotations.iterrows():
    for go in str(row['GOAnnotations']).split('|'):
        go_list.append({'Protein': row['Protein'], 'GO': go})
go_df = pd.DataFrame(go_list)
print(f"\n总GO注释数: {len(go_df)}")

第三步：域结构可视化¶

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def plot_domain_architecture(protein_id, ipr_df, output_file=None):
    """可视化蛋白质的域结构"""
    protein_data = ipr_df[ipr_df['Protein'] == protein_id].copy()
    protein_length = protein_data['Length'].iloc[0]

    # 只显示Pfam域
    domains = protein_data[protein_data['Analysis'] == 'Pfam'].copy()

    fig, ax = plt.subplots(figsize=(12, 2 + len(domains) * 0.3))

    # 画蛋白质骨架
    ax.plot([0, protein_length], [0, 0], 'k-', linewidth=3)
    ax.text(protein_length + 10, 0, f'{protein_length} aa', va='center')

    # 画每个域
    colors = plt.cm.Set3(range(len(domains)))
    for i, (_, domain) in enumerate(domains.iterrows()):
        rect = mpatches.FancyBboxPatch(
            (domain['Start'], -0.3), domain['Stop'] - domain['Start'], 0.6,
            boxstyle="round,pad=0.02", facecolor=colors[i], edgecolor='black', alpha=0.8
        )
        ax.add_patch(rect)
        mid = (domain['Start'] + domain['Stop']) / 2
        ax.text(mid, 0, domain['SignatureDesc'][:20], ha='center', va='center', fontsize=7)

    ax.set_xlim(-10, protein_length + 50)
    ax.set_ylim(-1, 1)
    ax.set_xlabel('Position (aa)')
    ax.set_title(f'Domain Architecture: {protein_id}')
    ax.set_yticks([])

    plt.tight_layout()
    if output_file:
        plt.savefig(output_file, dpi=300, bbox_inches='tight')
    plt.show()

# plot_domain_architecture("ProteinA", ipr)

第四步：批量统计与富集分析¶

library(ggplot2)

# 读取InterProScan结果
ipr <- read.delim("interproscan_results.tsv", header = FALSE,
                    col.names = c("Protein", "MD5", "Length", "Analysis",
                                  "SignatureAcc", "SignatureDesc", "Start", "Stop",
                                  "Score", "Status", "Date", "InterProAcc",
                                  "InterProDesc", "GO", "Pathways"))

# 统计每个蛋白质的域数量
domain_counts <- ipr %>%
  filter(Analysis == "Pfam") %>%
  group_by(Protein) %>%
  summarise(n_domains = n_distinct(SignatureAcc))

# 域数量分布
ggplot(domain_counts, aes(x = n_domains)) +
  geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(x = "Number of Domains", y = "Number of Proteins",
       title = "Domain Count Distribution")

# GO富集分析
library(clusterProfiler)
# 提取GO注释用于富集分析
go_mapping <- ipr %>%
  filter(!is.na(GO)) %>%
  separate_rows(GO, sep = "\\|") %>%
  select(Protein, GO) %>%
  distinct()

实战命令速查¶

# InterProScan完整注释
./interproscan.sh -i proteins.fa -o output.tsv -f tsv,gff3 -goterms -iprlookup -pa -cpu 16

# 仅运行Pfam
./interproscan.sh -i proteins.fa -o pfam_only.tsv -f tsv -appl Pfam -cpu 16

# hmmscan（直接用HMMER搜索Pfam）
hmmscan --domtblout pfam_results.txt --cpu 16 -E 1e-5 Pfam-A.hmm proteins.fa

# 在线API查询
curl -X POST "https://www.ebi.ac.uk/interpro/api/entry/interpro/protein/uniprot/P12345/"

面试常问点¶

Q1: InterPro与Pfam是什么关系？ A: Pfam是InterPro的成员数据库之一。InterPro整合了Pfam、CDD、SMART、PANTHER等15+个数据库，将不同数据库中描述相同功能域的条目统一为一个InterPro条目。2023年Pfam已正式并入InterPro团队维护。

Q2: 蛋白质域（domain）和基序（motif）有什么区别？ A: 域是蛋白质中较大的（通常>40个氨基酸）独立折叠单元，具有独立的结构和功能。基序是较短的保守序列模式（通常5-20个氨基酸），可能不独立折叠。PROSITE主要检测基序，Pfam主要检测域。

Q3: InterProScan为什么要整合多个数据库？ A: 不同数据库使用不同方法（HMM、PSSM、指纹图谱等）和关注不同层次的特征。整合提供更全面的注释覆盖。例如Pfam侧重域，PANTHER侧重家族，Gene3D侧重结构域，SMART侧重信号域。

Q4: HMM模型是如何检测蛋白质域的？ A: Hidden Markov Model使用概率模型描述一个域的序列特征。模型的每个位置记录了该位置各氨基酸出现的概率和插入/删除的概率。查询序列通过Viterbi算法与HMM比对，计算匹配概率。E-value反映匹配的统计显著性。

Q5: InterProScan的计算瓶颈是什么？ A: InterProScan需要依次运行多个分析程序，最慢的通常是基于HMM的搜索（如Pfam, PANTHER）。对于大规模基因组注释（数万蛋白质），建议并行化处理或使用集群。也可以通过-appl参数只运行需要的数据库。

易错点¶

输入非蛋白质序列：InterProScan只接受蛋白质（氨基酸）序列，不是核苷酸
Java版本不兼容：InterProScan对Java版本有要求（通常Java 11+）
数据库未完全下载：部分分析需要额外数据文件（如PANTHER数据库较大）
E-value阈值过松：默认阈值通常合适，自定义时不要设太大
忽略域的位置信息：同一蛋白质中域的排列顺序（domain architecture）本身就有功能含义

补充知识¶

InterPro成员数据库¶

数据库	方法	关注点
Pfam	HMM	蛋白质域和家族
CDD	PSSM	保守域
SMART	HMM	信号域和细胞外域
PANTHER	HMM	蛋白质家族和亚家族
Gene3D	HMM	结构域（CATH分类）
SUPERFAMILY	HMM	结构超家族（SCOP分类）
PRINTS	指纹图谱	蛋白质家族
PROSITE	正则/Profile	基序和域