771. 转录因子结合位点TFBS分析¶

一句话概述：找到转录因子(TF)在基因组上"停靠"的位置(TFBS)，理解基因表达的调控密码——就像找到所有"开关"的位置，知道哪些开关控制哪些灯。

核心知识点速查表¶

概念	白话解释	关键工具
TFBS	转录因子结合位点(5-20bp)	基因调控的"开关"
PWM/PSSM	位置权重矩阵（描述motif偏好）	JASPAR数据库
Motif发现	从序列中找重复出现的模式	HOMER/MEME
ChIP-seq	实验测TF结合位置	金标准
HOMER	超几何检验找motif	findMotifsGenome
MEME Suite	经典motif分析套件	MEME-ChIP

一、原理（白话版）¶

1.1 什么是TFBS？¶

转录因子(TF)：
  蛋白质，能识别并结合DNA上特定的短序列
  结合后影响附近基因的转录（激活或抑制）

TFBS(Transcription Factor Binding Site)：
  TF识别的DNA短序列（5-20bp）
  例如：p53识别 RRRCWWGYYY（R=A/G, W=A/T, Y=C/T）

为什么要找TFBS：
  ① 理解基因调控网络：谁控制谁？
  ② 疾病相关突变：SNP落在TFBS上 → 影响基因表达
  ③ 药物靶点：设计小分子干扰TF结合
  ④ 发育与分化：不同细胞类型TF结合谱不同

表示方法：
  共识序列：CACGTG（E-box）
  位置权重矩阵(PWM)：每个位置4个碱基的概率/权重
  序列logo：高度代表信息量，字母大小代表频率

1.2 分析策略¶

策略一：已知motif搜索（motif scanning）
  已知TF的PWM → 扫描基因组/序列 → 预测结合位点
  工具：FIMO, MAST, PWMScan, MOODS

策略二：从头motif发现（de novo motif discovery）
  有一组共调控序列 → 找其中重复出现的模式
  工具：HOMER, MEME, STREME

策略三：ChIP-seq + motif分析
  ChIP-seq找到TF结合区域 → motif分析确认结合序列
  工具：HOMER findMotifsGenome, MEME-ChIP

策略四：多组学整合（2025-2026新趋势）
  ChIP-seq + ATAC-seq + RNA-seq → 功能性TFBS
  工具：TFBSFootprinter, TFBS-Finder(DNABERT)

二、HOMER motif分析¶

2.1 安装与基本使用¶

# ===== 安装HOMER =====
# 方法一：conda安装
conda install -c bioconda homer  # conda安装HOMER

# 方法二：官方安装脚本
# wget http://homer.ucsd.edu/homer/configureHomer.pl
# perl configureHomer.pl -install

# 安装基因组数据
perl configureHomer.pl -install hg38  # 安装人类基因组hg38
perl configureHomer.pl -install mm10  # 安装小鼠基因组mm10

# ===== 从ChIP-seq峰中发现motif =====
# 输入：ChIP-seq的peak文件（BED格式）
# 输出：富集的已知motif + 从头发现的新motif

findMotifsGenome.pl \
  peaks.bed \                        # ChIP-seq峰文件(BED)
  hg38 \                             # 基因组版本
  output_dir/ \                      # 输出目录
  -size 200 \                        # 以peak中心为中心取200bp
  -mask \                            # 屏蔽重复序列
  -p 8 \                             # 使用8个线程
  -preparsedDir preparsed/           # 预解析目录（加速重复运行）

# 输出文件说明：
# output_dir/knownResults.html      → 已知motif富集结果
# output_dir/homerResults.html      → 从头发现的新motif
# output_dir/knownResults.txt       → 已知motif详细表格
# output_dir/homerMotifs.all.motifs → 所有发现的motif文件

2.2 不同场景的motif分析¶

# ===== 场景一：ATAC-seq开放区域的motif分析 =====
findMotifsGenome.pl \
  atac_peaks.bed \                   # ATAC-seq峰文件
  hg38 \                             # 基因组
  atac_motifs/ \                     # 输出目录
  -size given \                      # 使用实际峰大小（不扩展）
  -mask \                            # 屏蔽重复
  -p 8                               # 线程数

# ===== 场景二：启动子区域的motif分析 =====
# 输入差异基因列表 → 获取启动子 → 找motif

# 先获取基因启动子坐标
annotatePeaks.pl \
  tss \                              # 使用TSS
  hg38 \                             # 基因组
  -list gene_list.txt \              # 基因列表
  -size -1000,200 \                  # TSS上游1000到下游200
  > promoter_regions.txt             # 输出启动子区域

# 从启动子中找motif
findMotifsGenome.pl \
  promoter_regions.bed \             # 启动子区域BED
  hg38 \                             # 基因组
  promoter_motifs/ \                 # 输出
  -size given \                      # 使用给定大小
  -bg background_promoters.bed       # 背景：所有基因启动子

# ===== 场景三：差异结合分析 =====
# 比较两组ChIP-seq的motif差异
findMotifsGenome.pl \
  group_A_specific_peaks.bed \       # A组特异峰
  hg38 \                             # 基因组
  diff_motifs/ \                     # 输出
  -size 200 \                        # 200bp窗口
  -bg group_B_specific_peaks.bed     # 背景：B组特异峰

2.3 motif扫描与定量¶

# ===== 扫描已知motif在基因组区域中的位置 =====
# annotatePeaks.pl 是HOMER的多功能注释工具

# 统计峰区域中各motif的出现次数
annotatePeaks.pl \
  peaks.bed \                        # 峰文件
  hg38 \                             # 基因组
  -m known_motifs.motif \            # motif文件
  -nmotifs \                         # 统计motif数量
  > peaks_motif_counts.txt           # 输出motif计数

# 找特定motif在峰中的精确位置
annotatePeaks.pl \
  peaks.bed \                        # 峰文件
  hg38 \                             # 基因组
  -m known_motifs.motif \            # motif文件
  -mbed motif_positions.bed          # 输出motif位置(BED)

# ===== 从JASPAR获取motif =====
# JASPAR是最大的TF motif数据库
# 下载地址：https://jaspar.elixir.no/downloads/
# 格式转换
jaspar2homer.pl JASPAR.pfm > jaspar.motif  # 转HOMER格式

三、MEME Suite分析¶

3.1 MEME-ChIP流程¶

# ===== MEME Suite安装 =====
conda install -c bioconda meme  # conda安装MEME Suite

# ===== MEME-ChIP：ChIP-seq峰的完整motif分析 =====
# 集成了MEME + STREME + CentriMo + TOMTOM + FIMO

# Step 1: 提取峰序列
bedtools getfasta \
  -fi hg38.fa \                      # 基因组FASTA
  -bed peaks.bed \                   # 峰文件
  -fo peak_sequences.fasta           # 输出峰序列

# Step 2: 运行MEME-ChIP
meme-chip \
  -oc meme_chip_output/ \            # 输出目录
  -db JASPAR2024_CORE.meme \         # motif数据库(JASPAR)
  -db HOCOMOCO_v11.meme \            # motif数据库(HOCOMOCO)
  -meme-maxw 15 \                    # MEME最大motif宽度
  -meme-nmotifs 10 \                 # MEME发现的motif数
  -streme-nmotifs 10 \               # STREME发现的motif数
  -ccut 200 \                        # 以峰中心200bp
  peak_sequences.fasta               # 输入序列

# MEME-ChIP输出：
# meme_chip_output/meme-chip.html    → 综合结果网页
# meme_chip_output/combined.meme     → 所有发现的motif
# meme_chip_output/centrimo_out/     → motif中心富集分析

3.2 单独工具使用¶

# ===== MEME：从头motif发现 =====
meme \
  sequences.fasta \                  # 输入序列
  -oc meme_output/ \                 # 输出目录
  -dna \                             # DNA序列
  -mod zoops \                       # 每条序列0或1个motif
  -nmotifs 5 \                       # 发现5个motif
  -minw 6 \                          # 最小宽度6
  -maxw 15 \                         # 最大宽度15
  -revcomp                           # 同时搜索反向互补

# ===== STREME：快速motif发现（替代DREME） =====
streme \
  --oc streme_output/ \              # 输出目录
  --dna \                            # DNA序列
  --p peak_sequences.fasta \         # 前景序列（正样本）
  --n control_sequences.fasta \      # 背景序列（对照）
  --nmotifs 10                       # 发现10个motif

# ===== FIMO：已知motif扫描 =====
fimo \
  --oc fimo_output/ \                # 输出目录
  --thresh 1e-4 \                    # p值阈值
  --max-stored-scores 1000000 \      # 最大存储数
  known_motifs.meme \                # motif文件(MEME格式)
  genome.fasta                       # 基因组序列

# ===== TOMTOM：motif比较（找相似的已知motif）=====
tomtom \
  -oc tomtom_output/ \               # 输出目录
  -thresh 0.05 \                     # E值阈值
  novel_motifs.meme \                # 发现的新motif
  JASPAR2024_CORE.meme               # 已知motif数据库

# ===== CentriMo：motif中心富集分析 =====
# 检验motif是否富集在峰的中心（验证ChIP-seq质量）
centrimo \
  --oc centrimo_output/ \            # 输出目录
  peak_sequences.fasta \             # 峰序列
  known_motifs.meme                  # motif文件

四、Python整合分析¶

# ===== Python处理motif分析结果 =====
import pandas as pd  # 导入pandas
import numpy as np  # 导入numpy
import matplotlib.pyplot as plt  # 导入matplotlib

# ===== 解析HOMER已知motif结果 =====
def parse_homer_known(filepath):
    """解析HOMER knownResults.txt"""
    df = pd.read_csv(filepath, sep="\t")  # 读取tab分隔文件
    # 关键列：Motif Name, P-value, % of Targets, % of Background
    df["log10_pval"] = -np.log10(df["P-value"].astype(float))  # 计算-log10(p)
    return df.head(20)  # 返回前20个最显著的motif

# 可视化top motif
def plot_motif_enrichment(df, output="motif_enrichment.png"):
    """绘制motif富集柱状图"""
    fig, ax = plt.subplots(figsize=(10, 8))  # 创建画布

    top20 = df.nsmallest(20, "P-value")  # 取p值最小的20个
    y_pos = range(len(top20))  # y轴位置

    ax.barh(y_pos, top20["log10_pval"], color="steelblue")  # 水平柱状图
    ax.set_yticks(y_pos)  # 设置y刻度
    ax.set_yticklabels(top20["Motif Name"], fontsize=8)  # motif名称
    ax.set_xlabel("-log10(P-value)")  # x轴标签
    ax.set_title("Top 20 Enriched Motifs (HOMER)")  # 标题
    ax.invert_yaxis()  # 反转y轴（最显著在上）

    plt.tight_layout()  # 紧凑布局
    plt.savefig(output, dpi=300)  # 保存图片
    plt.close()  # 关闭画布

# ===== 使用Biopython处理motif =====
from Bio import motifs  # 导入Biopython的motif模块
from Bio.Seq import Seq  # 导入Seq类

# 从JASPAR数据库读取motif
with open("MA0139.1.jaspar") as f:
    motif = motifs.read(f, "jaspar")  # 读取JASPAR格式motif

print(f"TF名称: {motif.name}")  # 打印TF名称
print(f"共识序列: {motif.consensus}")  # 打印共识序列
print(f"反共识: {motif.anticonsensus}")  # 打印反共识序列

# 生成PWM
pwm = motif.counts.normalize()  # 归一化为概率矩阵
pssm = pwm.log_odds()  # 对数几率比(PSSM)

# 扫描序列
sequence = Seq("AGCTTCACGTGATCGATCG")  # 目标序列
for pos, score in pssm.search(sequence, threshold=5.0):  # 扫描
    print(f"位置: {pos}, 得分: {score:.2f}")  # 打印匹配位置和分数

# ===== 序列Logo可视化 =====
# pip install logomaker
import logomaker  # 导入logomaker

# 从PWM创建Logo
counts_df = pd.DataFrame(
    {base: list(motif.counts[base]) for base in "ACGT"},  # 计数矩阵
    index=range(len(motif))  # 位置索引
)

# 转为信息量(bits)
info_df = logomaker.transform_matrix(counts_df, from_type="counts", to_type="information")

# 绘制序列Logo
fig, ax = plt.subplots(figsize=(8, 2.5))  # 创建画布
logo = logomaker.Logo(info_df, ax=ax)  # 绘制Logo
ax.set_xlabel("Position")  # x轴标签
ax.set_ylabel("Information (bits)")  # y轴标签
ax.set_title(f"{motif.name} Motif Logo")  # 标题
plt.tight_layout()  # 紧凑布局
plt.savefig("motif_logo.png", dpi=300)  # 保存

五、TFBS功能验证与整合¶

# ===== TFBS与SNP/GWAS变异交叉分析 =====
import pybedtools  # 导入pybedtools

# 找落在TFBS上的GWAS变异
gwas_snps = pybedtools.BedTool("gwas_significant_snps.bed")  # GWAS显著SNP
tfbs_regions = pybedtools.BedTool("fimo_output/fimo.bed")  # FIMO预测的TFBS

# 交集：哪些SNP在TFBS上？
snps_in_tfbs = gwas_snps.intersect(tfbs_regions, wa=True, wb=True)  # 求交集
snps_in_tfbs.saveas("snps_disrupting_tfbs.bed")  # 保存结果

print(f"GWAS SNP总数: {gwas_snps.count()}")  # 打印SNP数
print(f"落在TFBS上的SNP: {snps_in_tfbs.count()}")  # 打印TFBS上的SNP

# ===== 多组学整合TFBS分析 =====
# 整合ChIP-seq + ATAC-seq + RNA-seq

# Step 1: 找到TF结合且染色质开放的区域
chipseq_peaks = pybedtools.BedTool("tf_chipseq_peaks.bed")  # ChIP-seq峰
atacseq_peaks = pybedtools.BedTool("atacseq_peaks.bed")  # ATAC-seq峰

# 交集 = 结合且开放的区域
active_tfbs = chipseq_peaks.intersect(atacseq_peaks, u=True)  # TF结合+开放
print(f"活跃TFBS数: {active_tfbs.count()}")  # 打印活跃TFBS数

# Step 2: 注释这些TFBS调控的基因
# 使用HOMER注释
# annotatePeaks.pl active_tfbs.bed hg38 > annotated_tfbs.txt

# Step 3: 与差异表达基因交叉
# 活跃TFBS附近的差异表达基因 = 可能的直接靶基因

六、常见报错与解决¶

报错信息	原因	解决方案
`HOMER: genome not installed`	基因组未安装	`perl configureHomer.pl -install hg38`
`MEME: too few sequences`	输入序列太少	至少50-100条序列
`FIMO: no significant hits`	阈值太严格	放宽-thresh到1e-3
`No motifs found`	信噪比低	用更严格的peak筛选
`PWM格式错误`	不同工具格式不同	用jaspar2meme或homer2meme转换
`背景模型不对`	默认背景不适合	提供匹配的背景序列(-bg)

七、面试高频问题¶

Q1: HOMER和MEME的区别？¶

A: HOMER用超几何检验对比前景vs背景序列中motif的富集，速度快适合大规模ChIP-seq分析。MEME用期望最大化(EM)算法从序列中发现模式，更适合小规模精确分析。HOMER对背景模型敏感，MEME对序列数量有限制。最佳实践是两者结合使用取交集。

Q2: 如何验证预测的TFBS是否真的有功能？¶

A: ①ChIP-seq/CUT&Tag实验验证TF在该位点有结合；②EMSA(凝胶迁移)验证体外结合；③荧光素酶报告基因检测该位点对转录的影响；④CRISPR删除/突变该位点观察表达变化；⑤整合ATAC-seq确认该区域染色质开放。

Q3: 从头motif发现需要注意什么？¶

A: ①序列要足够多(建议>100条)但不能太多(>10000条会太慢)；②序列长度统一(ChIP-seq取200bp中心区)；③选好背景序列(GC含量匹配、相同基因组区域)；④屏蔽重复序列(避免Alu等干扰)；⑤用多工具交叉验证。

八、速查表¶

# ===== TFBS分析速查 =====

# HOMER motif发现
findMotifsGenome.pl peaks.bed hg38 output/ -size 200 -mask -p 8

# HOMER motif扫描
annotatePeaks.pl peaks.bed hg38 -m motifs.motif -nmotifs > counts.txt

# MEME-ChIP
meme-chip -oc output/ -db JASPAR.meme peak_seqs.fasta

# MEME从头发现
meme seqs.fasta -oc output/ -dna -mod zoops -nmotifs 5

# STREME快速发现
streme --oc output/ --p fg.fasta --n bg.fasta --nmotifs 10

# FIMO motif扫描
fimo --thresh 1e-4 motifs.meme genome.fasta

# TOMTOM motif比较
tomtom novel.meme JASPAR.meme

# 数据库资源
# JASPAR: https://jaspar.elixir.no/
# HOCOMOCO: https://hocomoco11.autosome.org/
# CIS-BP: http://cisbp.ccbr.utoronto.ca/

# 工具选择
# ChIP-seq motif → HOMER findMotifsGenome + MEME-ChIP
# 启动子分析 → FIMO扫描JASPAR数据库
# 大规模扫描 → MOODS或PWMScan
# 深度学习 → TFBS-Finder(DNABERT, 2026)