增强子鉴定与靶基因预测¶
一句话概述:增强子是基因组上的"远程遥控器"——可以从几十kb甚至几Mb之外调控基因的表达,鉴定增强子及其靶基因对理解基因调控和疾病机制至关重要。
核心知识点表¶
| 知识点 | 白话解释 | 重要程度 |
|---|---|---|
| 增强子 | 远距离调控基因表达的DNA区域,像"遥控器" | ⭐⭐⭐⭐⭐ |
| H3K4me1 | 增强子标记组蛋白修饰 | ⭐⭐⭐⭐⭐ |
| H3K27ac | 活跃增强子标记(H3K4me1+H3K27ac=活跃增强子) | ⭐⭐⭐⭐⭐ |
| 超级增强子 | 特别强的增强子集群,调控细胞身份基因 | ⭐⭐⭐⭐ |
| 增强子-靶基因对 | 增强子调控哪个基因 | ⭐⭐⭐⭐⭐ |
| eQTL | 表达数量性状位点,遗传变异影响基因表达 | ⭐⭐⭐ |
一、增强子鉴定策略¶
增强子鉴定 = 找到基因组上的"遥控器"
鉴定策略对比:
──────────────────────────────────────────
方法 原理 可靠性
──────────────────────────────────────────
ChIP-seq H3K4me1+H3K27ac组合 ⭐⭐⭐⭐⭐
ATAC-seq 开放染色质区域 ⭐⭐⭐⭐
DNase-seq DNase超敏位点 ⭐⭐⭐⭐
STARR-seq 大规模功能筛选 ⭐⭐⭐⭐⭐
CRISPRi 功能验证(扰动实验) ⭐⭐⭐⭐⭐
保守性分析 跨物种序列保守 ⭐⭐⭐
──────────────────────────────────────────
增强子状态分类:
活跃增强子 → H3K4me1+ H3K27ac+(正在工作的遥控器)
准备态增强子 → H3K4me1+ H3K27ac-(已装电池但没按的遥控器)
"Poised"增强子 → H3K4me1+ H3K27me3+(被锁住的遥控器)
沉默增强子 → 无修饰标记(坏掉的遥控器)
二、基于ChIP-seq/ATAC-seq鉴定增强子¶
#!/usr/bin/env Rscript
# 增强子鉴定:基于H3K4me1和H3K27ac ChIP-seq数据
library(GenomicRanges) # 基因组区间操作
library(rtracklayer) # 读取BED/BigWig文件
# ========== 读取Peak文件 ==========
# 读取H3K4me1 peaks(增强子标记)
k4me1_peaks <- import("H3K4me1_peaks.narrowPeak", format = "BED")
cat("H3K4me1 peaks:", length(k4me1_peaks), "\n")
# 读取H3K27ac peaks(活跃标记)
k27ac_peaks <- import("H3K27ac_peaks.narrowPeak", format = "BED")
cat("H3K27ac peaks:", length(k27ac_peaks), "\n")
# 读取H3K4me3 peaks(启动子标记)
k4me3_peaks <- import("H3K4me3_peaks.narrowPeak", format = "BED")
cat("H3K4me3 peaks:", length(k4me3_peaks), "\n")
# ========== 鉴定活跃增强子 ==========
# 活跃增强子 = H3K4me1+ AND H3K27ac+ AND NOT H3K4me3+
# (有增强子标记,有活跃标记,但不是启动子)
# 第一步:H3K4me1与H3K27ac重叠区域
enhancer_candidates <- subsetByOverlaps(k4me1_peaks, k27ac_peaks) # 两个标记都有
cat("H3K4me1+H3K27ac+:", length(enhancer_candidates), "\n")
# 第二步:去除启动子区域(与H3K4me3重叠的)
active_enhancers <- enhancer_candidates[
!overlapsAny(enhancer_candidates, k4me3_peaks) # 排除启动子
]
cat("活跃增强子:", length(active_enhancers), "\n")
# 第三步:去除TSS附近区域(±2kb,因为启动子附近不算增强子)
# 读取基因TSS注释
tss <- import("hg38_tss.bed", format = "BED")
tss_extended <- resize(tss, width = 4000, fix = "center") # TSS ± 2kb
active_enhancers <- active_enhancers[
!overlapsAny(active_enhancers, tss_extended) # 排除TSS附近
]
cat("去除TSS后活跃增强子:", length(active_enhancers), "\n")
# ========== 保存结果 ==========
export(active_enhancers, "active_enhancers.bed", format = "BED")
cat("活跃增强子已保存到 active_enhancers.bed\n")
# ========== 鉴定准备态增强子 ==========
# 准备态增强子 = H3K4me1+ AND H3K27ac- AND NOT H3K4me3+
poised_enhancers <- k4me1_peaks[
!overlapsAny(k4me1_peaks, k27ac_peaks) & # 无H3K27ac
!overlapsAny(k4me1_peaks, k4me3_peaks) & # 无H3K4me3
!overlapsAny(k4me1_peaks, tss_extended) # 不在TSS附近
]
cat("准备态增强子:", length(poised_enhancers), "\n")
三、增强子-靶基因预测¶
#!/usr/bin/env python3
"""增强子-靶基因预测(多种方法)"""
import pandas as pd # 数据处理
import numpy as np # 数值计算
# ========== 方法一:最近基因法(最简单) ==========
def nearest_gene_method(enhancers_bed, genes_bed, max_distance=500000):
"""把增强子分配给最近的基因TSS"""
import pybedtools # BED文件操作
enh = pybedtools.BedTool(enhancers_bed) # 加载增强子
genes = pybedtools.BedTool(genes_bed) # 加载基因
# 找最近的基因
closest = enh.closest(genes, d=True) # 计算距离
results = []
for feat in closest: # 遍历每个增强子-基因对
fields = str(feat).strip().split('\t')
distance = int(fields[-1]) # 距离
if distance <= max_distance: # 限制最大距离
results.append({
"enhancer_chr": fields[0],
"enhancer_start": int(fields[1]),
"enhancer_end": int(fields[2]),
"gene_name": fields[6] if len(fields) > 6 else "unknown",
"distance": distance
})
df = pd.DataFrame(results)
print(f"最近基因法预测到 {len(df)} 个增强子-靶基因对")
return df
# ========== 方法二:基于Hi-C互作的预测 ==========
def hic_based_prediction(enhancers, genes, hic_loops):
"""
用Hi-C染色质环数据连接增强子和基因
原理:如果增强子和基因的启动子在Hi-C中有直接的loop → 它们可能有调控关系
"""
results = []
for _, loop in hic_loops.iterrows(): # 遍历每个Hi-C loop
anchor1 = (loop["chr1"], loop["x1"], loop["x2"]) # Loop锚点1
anchor2 = (loop["chr2"], loop["y1"], loop["y2"]) # Loop锚点2
# 检查增强子是否在锚点1,基因是否在锚点2(或反过来)
for _, enh in enhancers.iterrows():
if enh["chr"] == anchor1[0] and enh["start"] >= anchor1[1] and enh["end"] <= anchor1[2]:
# 增强子在锚点1,找锚点2内的基因
for _, gene in genes.iterrows():
if gene["chr"] == anchor2[0] and gene["tss"] >= anchor2[1] and gene["tss"] <= anchor2[2]:
results.append({
"enhancer": f"{enh['chr']}:{enh['start']}-{enh['end']}",
"gene": gene["name"],
"distance": abs(gene["tss"] - (enh["start"] + enh["end"]) // 2),
"method": "Hi-C loop"
})
df = pd.DataFrame(results)
print(f"Hi-C方法预测到 {len(df)} 个增强子-靶基因对")
return df
# ========== 方法三:基于相关性的预测 ==========
def correlation_based_prediction(enhancer_signals, gene_expression, threshold=0.6):
"""
基于增强子活性与基因表达的相关性预测靶基因
原理:如果增强子活性与基因表达高度相关 → 可能有调控关系
"""
from scipy.stats import pearsonr
results = []
for enh_id in enhancer_signals.index: # 遍历每个增强子
enh_signal = enhancer_signals.loc[enh_id].values # 增强子信号(多样品)
for gene_id in gene_expression.index: # 遍历每个基因
gene_expr = gene_expression.loc[gene_id].values # 基因表达量
r, p = pearsonr(enh_signal, gene_expr) # 计算皮尔逊相关
if abs(r) > threshold and p < 0.05: # 相关性>阈值且显著
results.append({
"enhancer": enh_id,
"gene": gene_id,
"correlation": r,
"p_value": p,
"method": "correlation"
})
df = pd.DataFrame(results)
df = df.sort_values("correlation", ascending=False)
print(f"相关性方法预测到 {len(df)} 个增强子-靶基因对")
return df
四、常用增强子数据库¶
# ========== 增强子数据库查询 ==========
databases = {
"ENCODE cCRE": {
"url": "https://screen.encodeproject.org/",
"description": "ENCODE候选顺式调控元件,包含增强子/启动子/绝缘子",
},
"FANTOM5": {
"url": "https://fantom.gsc.riken.jp/",
"description": "基于CAGE的增强子活性图谱",
},
"ENdb": {
"url": "http://www.enhancerdb.org/",
"description": "实验验证的增强子数据库",
},
"EnhancerAtlas 2.0": {
"url": "http://www.enhanceratlas.org/",
"description": "人类和小鼠增强子图谱",
},
"GeneHancer": {
"url": "https://genecards.weizmann.ac.il/geneloc/",
"description": "增强子-基因关联数据库(整合多种数据)",
},
"SEdb 2.0": {
"url": "http://www.licpathway.net/sedb/",
"description": "超级增强子数据库",
},
}
for name, info in databases.items():
print(f"{name:<25} → {info['url']}")
print(f" {info['description']}")
常见报错与解决¶
| 报错信息 | 原因 | 解决方法 |
|---|---|---|
No enhancers found | Peak数据质量差或阈值太严 | 检查ChIP-seq数据质量 |
Too many enhancer-gene pairs | 距离阈值太大 | 减小最大距离限制 |
BED format error | BED文件列数或格式不对 | 确认BED文件标准格式 |
Missing H3K27ac data | 缺少活跃标记数据 | 至少需要H3K27ac来区分活跃/准备态增强子 |
速查表¶
========================================
增强子鉴定与靶基因预测 速查表
========================================
【增强子鉴定标准】
活跃增强子 → H3K4me1+ H3K27ac+ H3K4me3-
准备态增强子 → H3K4me1+ H3K27ac- H3K4me3-
Poised增强子 → H3K4me1+ H3K27me3+
启动子(不是增强子) → H3K4me3+
【靶基因预测方法】
最近基因法 → 最简单,但准确率低
Hi-C/4C → 基于3D互作,较可靠
eQTL → 遗传变异-表达关联
相关性分析 → 多样品信号相关性
ABC模型 → Activity-by-Contact(推荐)
【常用数据库】
ENCODE cCRE → screen.encodeproject.org
FANTOM5 → fantom.gsc.riken.jp
EnhancerAtlas → enhanceratlas.org
GeneHancer → genecards.weizmann.ac.il
【面试考点】
Q: 怎么区分增强子和启动子?
A: 启动子有H3K4me3,增强子有H3K4me1但无H3K4me3
Q: 增强子怎么调控远端基因?
A: 通过染色质环(loop)与启动子在3D空间上接触
Q: 什么是超级增强子?
A: 多个增强子聚集形成的超大型调控区域,调控细胞身份基因
========================================
参考资料:ENCODE | FANTOM5 | Heintzman et al. Nature 2009 | Fulco et al. Nature Genetics 2019(ABC模型)