跳转至

增强子鉴定与靶基因预测

一句话概述:增强子是基因组上的"远程遥控器"——可以从几十kb甚至几Mb之外调控基因的表达,鉴定增强子及其靶基因对理解基因调控和疾病机制至关重要。

核心知识点表

知识点白话解释重要程度
增强子远距离调控基因表达的DNA区域,像"遥控器"⭐⭐⭐⭐⭐
H3K4me1增强子标记组蛋白修饰⭐⭐⭐⭐⭐
H3K27ac活跃增强子标记(H3K4me1+H3K27ac=活跃增强子)⭐⭐⭐⭐⭐
超级增强子特别强的增强子集群,调控细胞身份基因⭐⭐⭐⭐
增强子-靶基因对增强子调控哪个基因⭐⭐⭐⭐⭐
eQTL表达数量性状位点,遗传变异影响基因表达⭐⭐⭐

一、增强子鉴定策略

增强子鉴定 = 找到基因组上的"遥控器"

鉴定策略对比:
──────────────────────────────────────────
方法              原理                    可靠性
──────────────────────────────────────────
ChIP-seq          H3K4me1+H3K27ac组合      ⭐⭐⭐⭐⭐
ATAC-seq          开放染色质区域            ⭐⭐⭐⭐
DNase-seq         DNase超敏位点            ⭐⭐⭐⭐
STARR-seq         大规模功能筛选            ⭐⭐⭐⭐⭐
CRISPRi           功能验证(扰动实验)      ⭐⭐⭐⭐⭐
保守性分析        跨物种序列保守            ⭐⭐⭐
──────────────────────────────────────────

增强子状态分类:
  活跃增强子   → H3K4me1+ H3K27ac+(正在工作的遥控器)
  准备态增强子 → H3K4me1+ H3K27ac-(已装电池但没按的遥控器)
  "Poised"增强子 → H3K4me1+ H3K27me3+(被锁住的遥控器)
  沉默增强子   → 无修饰标记(坏掉的遥控器)

二、基于ChIP-seq/ATAC-seq鉴定增强子

#!/usr/bin/env Rscript
# 增强子鉴定:基于H3K4me1和H3K27ac ChIP-seq数据

library(GenomicRanges)  # 基因组区间操作
library(rtracklayer)    # 读取BED/BigWig文件

# ========== 读取Peak文件 ==========
# 读取H3K4me1 peaks(增强子标记)
k4me1_peaks <- import("H3K4me1_peaks.narrowPeak", format = "BED")
cat("H3K4me1 peaks:", length(k4me1_peaks), "\n")

# 读取H3K27ac peaks(活跃标记)
k27ac_peaks <- import("H3K27ac_peaks.narrowPeak", format = "BED")
cat("H3K27ac peaks:", length(k27ac_peaks), "\n")

# 读取H3K4me3 peaks(启动子标记)
k4me3_peaks <- import("H3K4me3_peaks.narrowPeak", format = "BED")
cat("H3K4me3 peaks:", length(k4me3_peaks), "\n")

# ========== 鉴定活跃增强子 ==========
# 活跃增强子 = H3K4me1+ AND H3K27ac+ AND NOT H3K4me3+
# (有增强子标记,有活跃标记,但不是启动子)

# 第一步:H3K4me1与H3K27ac重叠区域
enhancer_candidates <- subsetByOverlaps(k4me1_peaks, k27ac_peaks)  # 两个标记都有
cat("H3K4me1+H3K27ac+:", length(enhancer_candidates), "\n")

# 第二步:去除启动子区域(与H3K4me3重叠的)
active_enhancers <- enhancer_candidates[
    !overlapsAny(enhancer_candidates, k4me3_peaks)  # 排除启动子
]
cat("活跃增强子:", length(active_enhancers), "\n")

# 第三步:去除TSS附近区域(±2kb,因为启动子附近不算增强子)
# 读取基因TSS注释
tss <- import("hg38_tss.bed", format = "BED")
tss_extended <- resize(tss, width = 4000, fix = "center")  # TSS ± 2kb

active_enhancers <- active_enhancers[
    !overlapsAny(active_enhancers, tss_extended)  # 排除TSS附近
]
cat("去除TSS后活跃增强子:", length(active_enhancers), "\n")

# ========== 保存结果 ==========
export(active_enhancers, "active_enhancers.bed", format = "BED")
cat("活跃增强子已保存到 active_enhancers.bed\n")

# ========== 鉴定准备态增强子 ==========
# 准备态增强子 = H3K4me1+ AND H3K27ac- AND NOT H3K4me3+
poised_enhancers <- k4me1_peaks[
    !overlapsAny(k4me1_peaks, k27ac_peaks) &  # 无H3K27ac
    !overlapsAny(k4me1_peaks, k4me3_peaks) &   # 无H3K4me3
    !overlapsAny(k4me1_peaks, tss_extended)     # 不在TSS附近
]
cat("准备态增强子:", length(poised_enhancers), "\n")

三、增强子-靶基因预测

#!/usr/bin/env python3
"""增强子-靶基因预测(多种方法)"""

import pandas as pd  # 数据处理
import numpy as np  # 数值计算

# ========== 方法一:最近基因法(最简单) ==========
def nearest_gene_method(enhancers_bed, genes_bed, max_distance=500000):
    """把增强子分配给最近的基因TSS"""
    import pybedtools  # BED文件操作

    enh = pybedtools.BedTool(enhancers_bed)  # 加载增强子
    genes = pybedtools.BedTool(genes_bed)  # 加载基因

    # 找最近的基因
    closest = enh.closest(genes, d=True)  # 计算距离

    results = []
    for feat in closest:  # 遍历每个增强子-基因对
        fields = str(feat).strip().split('\t')
        distance = int(fields[-1])  # 距离
        if distance <= max_distance:  # 限制最大距离
            results.append({
                "enhancer_chr": fields[0],
                "enhancer_start": int(fields[1]),
                "enhancer_end": int(fields[2]),
                "gene_name": fields[6] if len(fields) > 6 else "unknown",
                "distance": distance
            })

    df = pd.DataFrame(results)
    print(f"最近基因法预测到 {len(df)} 个增强子-靶基因对")
    return df

# ========== 方法二:基于Hi-C互作的预测 ==========
def hic_based_prediction(enhancers, genes, hic_loops):
    """
    用Hi-C染色质环数据连接增强子和基因
    原理:如果增强子和基因的启动子在Hi-C中有直接的loop → 它们可能有调控关系
    """
    results = []

    for _, loop in hic_loops.iterrows():  # 遍历每个Hi-C loop
        anchor1 = (loop["chr1"], loop["x1"], loop["x2"])  # Loop锚点1
        anchor2 = (loop["chr2"], loop["y1"], loop["y2"])  # Loop锚点2

        # 检查增强子是否在锚点1,基因是否在锚点2(或反过来)
        for _, enh in enhancers.iterrows():
            if enh["chr"] == anchor1[0] and enh["start"] >= anchor1[1] and enh["end"] <= anchor1[2]:
                # 增强子在锚点1,找锚点2内的基因
                for _, gene in genes.iterrows():
                    if gene["chr"] == anchor2[0] and gene["tss"] >= anchor2[1] and gene["tss"] <= anchor2[2]:
                        results.append({
                            "enhancer": f"{enh['chr']}:{enh['start']}-{enh['end']}",
                            "gene": gene["name"],
                            "distance": abs(gene["tss"] - (enh["start"] + enh["end"]) // 2),
                            "method": "Hi-C loop"
                        })

    df = pd.DataFrame(results)
    print(f"Hi-C方法预测到 {len(df)} 个增强子-靶基因对")
    return df

# ========== 方法三:基于相关性的预测 ==========
def correlation_based_prediction(enhancer_signals, gene_expression, threshold=0.6):
    """
    基于增强子活性与基因表达的相关性预测靶基因
    原理:如果增强子活性与基因表达高度相关 → 可能有调控关系
    """
    from scipy.stats import pearsonr

    results = []
    for enh_id in enhancer_signals.index:  # 遍历每个增强子
        enh_signal = enhancer_signals.loc[enh_id].values  # 增强子信号(多样品)

        for gene_id in gene_expression.index:  # 遍历每个基因
            gene_expr = gene_expression.loc[gene_id].values  # 基因表达量

            r, p = pearsonr(enh_signal, gene_expr)  # 计算皮尔逊相关

            if abs(r) > threshold and p < 0.05:  # 相关性>阈值且显著
                results.append({
                    "enhancer": enh_id,
                    "gene": gene_id,
                    "correlation": r,
                    "p_value": p,
                    "method": "correlation"
                })

    df = pd.DataFrame(results)
    df = df.sort_values("correlation", ascending=False)
    print(f"相关性方法预测到 {len(df)} 个增强子-靶基因对")
    return df

四、常用增强子数据库

# ========== 增强子数据库查询 ==========
databases = {
    "ENCODE cCRE": {
        "url": "https://screen.encodeproject.org/",
        "description": "ENCODE候选顺式调控元件,包含增强子/启动子/绝缘子",
    },
    "FANTOM5": {
        "url": "https://fantom.gsc.riken.jp/",
        "description": "基于CAGE的增强子活性图谱",
    },
    "ENdb": {
        "url": "http://www.enhancerdb.org/",
        "description": "实验验证的增强子数据库",
    },
    "EnhancerAtlas 2.0": {
        "url": "http://www.enhanceratlas.org/",
        "description": "人类和小鼠增强子图谱",
    },
    "GeneHancer": {
        "url": "https://genecards.weizmann.ac.il/geneloc/",
        "description": "增强子-基因关联数据库(整合多种数据)",
    },
    "SEdb 2.0": {
        "url": "http://www.licpathway.net/sedb/",
        "description": "超级增强子数据库",
    },
}

for name, info in databases.items():
    print(f"{name:<25}{info['url']}")
    print(f"  {info['description']}")

常见报错与解决

报错信息原因解决方法
No enhancers foundPeak数据质量差或阈值太严检查ChIP-seq数据质量
Too many enhancer-gene pairs距离阈值太大减小最大距离限制
BED format errorBED文件列数或格式不对确认BED文件标准格式
Missing H3K27ac data缺少活跃标记数据至少需要H3K27ac来区分活跃/准备态增强子

速查表

========================================
增强子鉴定与靶基因预测 速查表
========================================

【增强子鉴定标准】
活跃增强子           → H3K4me1+ H3K27ac+ H3K4me3-
准备态增强子         → H3K4me1+ H3K27ac- H3K4me3-
Poised增强子         → H3K4me1+ H3K27me3+
启动子(不是增强子)  → H3K4me3+

【靶基因预测方法】
最近基因法           → 最简单,但准确率低
Hi-C/4C             → 基于3D互作,较可靠
eQTL               → 遗传变异-表达关联
相关性分析           → 多样品信号相关性
ABC模型             → Activity-by-Contact(推荐)

【常用数据库】
ENCODE cCRE         → screen.encodeproject.org
FANTOM5             → fantom.gsc.riken.jp
EnhancerAtlas       → enhanceratlas.org
GeneHancer          → genecards.weizmann.ac.il

【面试考点】
Q: 怎么区分增强子和启动子?
A: 启动子有H3K4me3,增强子有H3K4me1但无H3K4me3

Q: 增强子怎么调控远端基因?
A: 通过染色质环(loop)与启动子在3D空间上接触

Q: 什么是超级增强子?
A: 多个增强子聚集形成的超大型调控区域,调控细胞身份基因
========================================

参考资料:ENCODE | FANTOM5 | Heintzman et al. Nature 2009 | Fulco et al. Nature Genetics 2019(ABC模型)