转录因子结合位点预测¶

一句话概述¶

通过JASPAR数据库获取转录因子（TF）结合motif，使用MEME Suite/FIMO进行序列扫描预测潜在结合位点，结合ATAC-seq/DNase-seq的TF footprinting验证真实结合事件，是基因调控研究的关键分析手段。

核心知识点表格¶

知识点	关键内容	常用工具
Motif数据库	PWM/PFM矩阵存储结合偏好	JASPAR, HOCOMOCO, CIS-BP
De novo motif发现	从序列集中发现过表达motif	MEME, HOMER, STREME
Motif扫描	在基因组序列中搜索已知motif	FIMO, MOODS, PWMScan
TF footprinting	开放染色质中TF保护的切割低谷	HINT-ATAC, TOBIAS, Wellington
Motif富集分析	Peak区域的motif过度表示检验	AME, HOMER findMotifsGenome
调控网络构建	TF-target关系推断	CistromeDB, ENCODE
协同因子分析	共现motif对发现	SpaMo, HOMER

各步骤详解¶

第一步：理解转录因子结合motif¶

白话解释： 转录因子是一类蛋白质，它们通过识别DNA上特定的短序列模式来调控基因表达。比如p53蛋白专门识别"RRRCWWGYYY"这样的序列。这种偏好模式可以用位置权重矩阵（PWM）来数学表达——矩阵的每一列代表motif的一个位置，每个位置上A、C、G、T的出现概率。

技术细节： - PFM (Position Frequency Matrix): 原始频率矩阵 - PWM (Position Weight Matrix): 对数似然比矩阵 = log2(频率/背景频率) - 信息含量(IC): 每个位置的保守程度，最大2 bits - E-value/p-value: motif发现和扫描的统计显著性 - 假阳性问题：短motif在随机序列中也频繁出现

代码示例：

# Python: 使用JASPAR数据库获取TF motif
from Bio import motifs
from Bio.motifs.jaspar.db import JASPAR5
import numpy as np

# 或者直接从JASPAR网站下载PFM文件
# https://jaspar.genereg.net/

# 读取JASPAR格式motif
with open("MA0106.1.jaspar") as f:
    m = motifs.read(f, "jaspar")

print(m.name)        # ETS1
print(m.matrix_id)   # MA0106.1
print(m.counts)      # PFM矩阵
print(m.pwm)         # PWM矩阵
print(m.pssm)        # PSSM (log-odds)

# 计算信息含量
print(m.pssm.mean())
print(m.pssm.std())

# 扫描序列
from Bio.Seq import Seq
test_seq = Seq("ATCGGAAGTCGATCGATCG")
for pos, score in m.pssm.search(test_seq, threshold=3.0):
    print(f"Position {pos}: score {score:.2f}")

第二步：De novo motif发现（MEME Suite）¶

白话解释： 当你有一组序列（比如ChIP-seq peak区域），想知道里面富集了什么DNA模式，就用de novo motif发现工具。MEME会自动搜索序列中反复出现的短模式，不需要预先知道要找什么。

技术细节： - MEME: 期望最大化算法，找最显著的motif - STREME: 更快的统计方法，适合大数据集 - HOMER: 差异富集策略，需要背景序列 - 建议输入序列：peak summit ±100-250bp，不宜太长 - 控制序列选择：相同GC含量的随机基因组区域

代码示例：

# 安装MEME Suite
conda install -c bioconda meme

# 从ChIP-seq peak提取序列
# 先取peak summit ±150bp
awk '{OFS="\t"; mid=int(($2+$3)/2); print $1,mid-150,mid+150}' \
    peaks.narrowPeak > peak_regions.bed

bedtools getfasta -fi genome.fa -bed peak_regions.bed \
    -fo peak_sequences.fasta

# 方法1: MEME (经典，适合<1000条序列)
meme peak_sequences.fasta \
    -dna \
    -oc meme_output \
    -mod zoops \          # 每条序列0或1个motif实例
    -nmotifs 10 \         # 找10个motif
    -minw 6 -maxw 20 \   # motif长度范围
    -revcomp \            # 考虑反向互补
    -markov_order 1       # 1阶马尔可夫背景模型

# 方法2: STREME (快速，适合大数据集)
streme --dna \
    --oc streme_output \
    -p peak_sequences.fasta \
    --minw 6 --maxw 20 \
    --thresh 0.05

# 方法3: HOMER
findMotifsGenome.pl peaks.narrowPeak hg38 homer_output/ \
    -size 200 \
    -mask \
    -p 8

# HOMER优势：自动与已知motif比较，报告p-value和百分比

第三步：已知Motif扫描（FIMO）¶

白话解释： 如果你已经知道某个TF的结合motif（比如从JASPAR获取），想知道基因组中或者特定区域里哪些位置可能被这个TF结合，就用FIMO扫描。它会用PWM给每个位置打分，得分超过阈值的就是预测结合位点。

技术细节： - FIMO使用动态规划计算p-value - 默认阈值: p < 1e-4 - 需要正确设置背景核苷酸频率 - 全基因组扫描会产生大量预测（多数为假阳性） - 结合开放染色质数据过滤可大幅降低假阳性

代码示例：

# 从JASPAR下载motif文件（MEME格式）
# https://jaspar.genereg.net/downloads/

# 单个motif扫描
fimo --oc fimo_output \
    --thresh 1e-4 \
    --bfile background_model.txt \
    MA0139.1.meme \      # CTCF motif
    peak_sequences.fasta

# 全基因组扫描（使用所有JASPAR motif）
fimo --oc fimo_genome \
    --thresh 1e-5 \
    --max-stored-scores 10000000 \
    JASPAR2024_CORE_vertebrates.meme \
    genome.fa

# 只扫描开放染色质区域（大幅减少假阳性）
bedtools getfasta -fi genome.fa \
    -bed atac_peaks.bed -fo open_chromatin.fasta

fimo --oc fimo_open_regions \
    --thresh 1e-4 \
    JASPAR2024_CORE_vertebrates.meme \
    open_chromatin.fasta

# 解析FIMO输出
# fimo.tsv包含: motif_id, sequence_name, start, stop, strand, score, p-value, q-value
head fimo_output/fimo.tsv

# 过滤结果
awk -F'\t' '$8 < 0.05' fimo_output/fimo.tsv > significant_sites.tsv

第四步：Motif富集分析（AME）¶

白话解释： motif富集分析回答的问题是：在我的peak区域中，某个TF的motif出现频率是否显著高于背景区域？这帮助判断哪些TF可能参与了观察到的生物学过程。

代码示例：

# AME (Analysis of Motif Enrichment)
ame --oc ame_output \
    --control --shuffle-- \   # 对照：打乱输入序列
    --method fisher \
    peak_sequences.fasta \
    JASPAR2024_CORE_vertebrates.meme

# 使用特定背景序列
ame --oc ame_output \
    --control background_sequences.fasta \
    --method ranksum \
    peak_sequences.fasta \
    JASPAR2024_CORE_vertebrates.meme

# CentriMo: 检测motif在peak中心的富集
centrimo --oc centrimo_output \
    --ethresh 10 \
    peak_sequences.fasta \
    JASPAR2024_CORE_vertebrates.meme

# 如果motif在peak中心显著富集，说明该TF可能是direct binder

第五步：TF Footprinting分析¶

白话解释： TF footprinting是用ATAC-seq或DNase-seq数据验证TF是否真正结合在预测位点上的方法。原理是：当TF结合在DNA上时，它会保护该区域的DNA不被Tn5转座酶（ATAC-seq）或DNase切割，形成一个"脚印"——即在开放染色质区域内的一个小低谷。

技术细节： - 需要高深度ATAC-seq/DNase-seq数据（>100M reads） - 切割信号校正：Tn5有序列偏好（需要bias correction） - 聚合footprint：对同一TF的所有预测位点取平均信号 - 差异footprinting：比较条件间TF结合变化

代码示例：

# 方法1: TOBIAS (推荐，专门为ATAC-seq设计)
# 安装
pip install tobias

# 步骤1：Tn5偏好校正
TOBIAS ATACorrect \
    --bam atac_sorted.bam \
    --genome genome.fa \
    --peaks atac_peaks.bed \
    --outdir tobias_output/ \
    --cores 8

# 步骤2：计算footprint scores
TOBIAS FootprintScores \
    --signal tobias_output/*_corrected.bw \
    --regions atac_peaks.bed \
    --output tobias_output/footprint_scores.bw \
    --cores 8

# 步骤3：识别bound/unbound位点
TOBIAS BINDetect \
    --motifs JASPAR2024_CORE_vertebrates.meme \
    --signals tobias_output/condition1_footprints.bw \
              tobias_output/condition2_footprints.bw \
    --genome genome.fa \
    --peaks atac_peaks.bed \
    --outdir tobias_bindetect/ \
    --cores 8 \
    --cond_names condition1 condition2

# 步骤4：可视化聚合footprint
TOBIAS PlotAggregate \
    --TFBS tobias_bindetect/CTCF/beds/CTCF_condition1_bound.bed \
    --signals tobias_output/condition1_corrected.bw \
              tobias_output/condition2_corrected.bw \
    --output CTCF_footprint.pdf \
    --share_y both \
    --plot_boundaries

# 方法2: HINT-ATAC
rgt-hint footprinting \
    --atac-seq \
    --paired-end \
    --organism=hg38 \
    --output-location=hint_output/ \
    atac_sorted.bam atac_peaks.bed

第六步：整合分析与调控网络¶

白话解释： 将motif预测、footprinting验证、基因表达数据整合在一起，构建TF调控网络：哪些TF调控哪些基因，TF之间有什么协同或拮抗关系。

代码示例：

# R: 整合分析
library(motifmatchr)
library(chromVAR)
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg38)

# 使用chromVAR评估TF活性变异
# 读取peak计数矩阵
counts <- readRDS("peak_counts_matrix.rds")
peaks <- readRDS("peak_granges.rds")

# 获取JASPAR motif
library(JASPAR2024)
library(TFBSTools)
motifs <- getMatrixSet(JASPAR2024,
    opts=list(collection="CORE", tax_group="vertebrates"))

# 匹配motif到peaks
motif_ix <- matchMotifs(motifs, peaks,
    genome = BSgenome.Hsapiens.UCSC.hg38,
    out = "scores")

# chromVAR分析
library(chromVAR)
dev <- computeDeviations(object = counts, 
                         annotations = motif_ix)

# 提取TF活性z-scores
tf_scores <- deviationScores(dev)

# 差异TF活性
# 使用limma或t-test比较条件间TF活性差异
library(limma)
design <- model.matrix(~condition)
fit <- lmFit(tf_scores, design)
fit <- eBayes(fit)
results <- topTable(fit, number=Inf)

实战命令¶

# === 完整TF结合位点分析流程 ===

# 1. 获取motif数据库
wget https://jaspar.genereg.net/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

# 2. 从差异ATAC peaks提取序列
bedtools getfasta -fi hg38.fa -bed diff_peaks.bed -fo diff_peaks.fasta

# 3. De novo motif发现
meme diff_peaks.fasta -dna -oc meme_out -mod zoops -nmotifs 15 -minw 6 -maxw 20 -revcomp

# 4. 已知motif富集
ame --oc ame_out --control --shuffle-- diff_peaks.fasta JASPAR2024_CORE_vertebrates.meme

# 5. Motif扫描
fimo --oc fimo_out --thresh 1e-4 JASPAR2024_CORE_vertebrates.meme diff_peaks.fasta

# 6. TF Footprinting
TOBIAS ATACorrect --bam atac.bam --genome hg38.fa --peaks peaks.bed --outdir tobias/ --cores 16
TOBIAS FootprintScores --signal tobias/*_corrected.bw --regions peaks.bed --output tobias/scores.bw --cores 16
TOBIAS BINDetect --motifs JASPAR.meme --signals tobias/scores.bw --genome hg38.fa --peaks peaks.bed --outdir bindetect/ --cores 16

面试常问点¶

Q1: PWM/PSSM的数学原理是什么？ A: PWM中每个元素 = log2(观察频率/背景频率)。序列得分 = 各位置对应碱基的PWM值之和。p-value通过动态规划或极值分布计算。高得分表示序列与motif高度匹配。

Q2: De novo motif发现和motif扫描有什么区别？ A: De novo是无监督的，从序列中发现未知模式；扫描是有目标的，用已知motif在序列中找匹配位点。前者用于发现新调控因子，后者用于预测已知TF的结合位点。

Q3: 如何降低motif预测的假阳性？ A: 1）结合开放染色质数据（只在accessible区域预测）；2）使用footprinting验证实际结合；3）结合ChIP-seq验证；4）考虑进化保守性（结合phastCons）；5）使用更严格的p-value阈值。

Q4: ATAC-seq footprinting需要什么条件？ A: 需要高深度（>100M mapped reads），需要进行Tn5切割偏好校正，需要足够数量的motif实例来聚合信号。单个位点的footprint通常不可靠，聚合分析更有统计力。

Q5: 如何解释motif富集但footprint缺失的情况？ A: 可能原因：1）TF在该条件下未表达；2）TF结合需要辅因子；3）motif被核小体占据；4）TF结合是瞬时的；5）纯motif匹配本身的假阳性。

易错点¶

背景模型选择不当：使用均匀分布（25% each）作为背景会产生大量假阳性，应根据研究区域的GC含量设置。
序列长度不一致：MEME对输入序列长度敏感。建议统一取peak summit ±150bp，过长序列稀释信号。
多重检验校正缺失：全基因组扫描数百万位点，不做FDR校正会得到海量无意义结果。
忽略motif冗余：JASPAR中很多motif非常相似（同家族TF），需要去冗余或用非冗余子集。
Footprinting深度不足：低深度数据做footprinting只看到noise，至少需要100M以上高质量reads。
方向性丢失：某些TF结合有方向性，需要保留strand信息进行下游分析。

补充知识¶

主要Motif数据库比较¶

数据库	Motif数量	特点
JASPAR 2024	~2346（CORE）	质量高，手动审核，开放获取
HOCOMOCO v12	~1443（CORE）	基于ChIP-seq/HT-SELEX数据，人和小鼠
CIS-BP	~15000	最全面，包含预测motif
TRANSFAC	~7000+	最早的商业数据库
ENCODE motifs	~700	来自ENCODE ChIP-seq

TF结合的层级模型¶

Level 1: 序列偏好（PWM匹配）→ 必要但不充分
Level 2: 染色质可及性（开放区域）→ 过滤大部分假阳性
Level 3: 辅因子共结合 → 确定细胞类型特异性
Level 4: 3D基因组结构 → 确定作用靶基因
Level 5: 表观遗传修饰 → H3K4me1增强子标记

MEME Suite工具矩阵¶

工具	功能	输入
MEME	de novo motif发现	序列集
STREME	快速motif发现	序列集
FIMO	motif位点扫描	motif + 序列
AME	motif富集分析	序列集 + motif库
CentriMo	中心富集分析	peak序列 + motif库
TOMTOM	motif相似性比较	query motif + 数据库
SpaMo	motif间距分析	peak序列 + motif

转录因子结合位点预测¶

一句话概述¶

核心知识点表格¶

各步骤详解¶

第一步：理解转录因子结合motif¶

第二步：De novo motif发现（MEME Suite）¶

第三步：已知Motif扫描（FIMO）¶

第四步：Motif富集分析（AME）¶

第五步：TF Footprinting分析¶

第六步：整合分析与调控网络¶

实战命令¶

面试常问点¶

易错点¶

补充知识¶

主要Motif数据库比较¶

TF结合的层级模型¶

MEME Suite工具矩阵¶

📚 相关文章推荐