非编码RNA靶基因预测¶
一句话概述:非编码RNA(ncRNA)不翻译成蛋白质但能调控基因表达——miRNA像"小剪刀"降解mRNA,lncRNA像"调度员"指挥基因开关,预测它们的靶基因是理解基因调控网络的关键。
核心知识点表¶
| 知识点 | 白话解释 | 重要程度 |
|---|---|---|
| miRNA | 约22nt的小RNA,通过碱基配对沉默靶基因 | ⭐⭐⭐⭐⭐ |
| lncRNA | >200nt的长非编码RNA,多种调控机制 | ⭐⭐⭐⭐⭐ |
| 种子序列 | miRNA的2-8位碱基,与靶基因3'UTR互补配对 | ⭐⭐⭐⭐⭐ |
| TargetScan | 最常用的miRNA靶基因预测工具 | ⭐⭐⭐⭐⭐ |
| miRDB | 机器学习预测miRNA靶基因的数据库 | ⭐⭐⭐⭐ |
| ceRNA假说 | lncRNA通过"海绵吸附"miRNA间接调控靶基因 | ⭐⭐⭐⭐ |
一、非编码RNA调控原理¶
miRNA调控机制(最常见):
miRNA(~22nt) + RISC复合物 → 与靶基因mRNA的3'UTR配对 → 降解或抑制翻译
miRNA: 3'- UGAGGUAGUAGGUUGUAUAGUU -5' (let-7a)
||||||| 种子序列配对
靶基因3'UTR: 5'- ...CUACCUC... -3'
种子序列类型(配对严格程度):
8mer → 完美8碱基配对(最严格,最可靠)
7mer-m8 → 7碱基配对+A1位点匹配
7mer-A1 → 7碱基配对(种子2-7+A1)
6mer → 6碱基配对(最宽松,假阳性多)
lncRNA调控机制(多样化):
1. ceRNA(竞争性内源RNA)→ 海绵吸附miRNA
2. 表观遗传调控 → 招募PRC2等修饰复合物
3. 转录调控 → 与转录因子互作
4. mRNA稳定性调控 → 稳定或降解mRNA
5. 翻译调控 → 影响核糖体翻译
二、miRNA靶基因预测¶
2.1 在线工具¶
#!/usr/bin/env python3
"""miRNA靶基因预测 - 多工具整合"""
import requests # HTTP请求
import pandas as pd # 数据处理
# ========== 1. TargetScan API查询 ==========
def query_targetscan(mirna_name, species="human"):
"""从TargetScan网站获取miRNA靶基因"""
# TargetScan网址:https://www.targetscan.org/
# 需要手动下载或爬取,这里提供解析方法
print(f"TargetScan查询: {mirna_name}")
print(f"请访问: https://www.targetscan.org/cgi-bin/targetscan/vert_80/targetscan.cgi?mirg={mirna_name}")
print("下载结果后用下面的代码解析")
# ========== 2. miRDB API查询 ==========
def query_mirdb(mirna_name):
"""从miRDB获取miRNA靶基因预测"""
url = "http://mirdb.org/cgi-bin/search.cgi"
data = {
"searchType": "miRNA",
"searchBox": mirna_name,
"species": "Human"
}
print(f"miRDB查询: {mirna_name}")
print(f"请访问: http://mirdb.org/ 搜索 {mirna_name}")
print("miRDB分数>80为高可信度预测")
# ========== 3. 解析TargetScan下载文件 ==========
def parse_targetscan(filename):
"""解析TargetScan预测结果"""
df = pd.read_csv(filename, sep="\t")
# 关键列:
# Gene Symbol - 靶基因名
# Cumulative weighted context++ score - 综合打分(越负越可靠)
# Total num of conserved sites - 保守位点数
# Total num of poorly conserved sites - 非保守位点数
# 按context++分数排序
df_sorted = df.sort_values("Cumulative weighted context++ score")
# 筛选高可信度靶基因
high_conf = df_sorted[
df_sorted["Cumulative weighted context++ score"] < -0.2 # 分数<-0.2
]
print(f"高可信度靶基因数: {len(high_conf)}")
return high_conf
# ========== 4. 多工具交叉验证 ==========
def cross_validate_targets(targetscan_genes, mirdb_genes, mirtarbase_genes):
"""取多个工具预测的交集,提高可靠性"""
# 交集:至少两个工具都预测到
ts_set = set(targetscan_genes)
mdb_set = set(mirdb_genes)
mtb_set = set(mirtarbase_genes)
# 两两交集
ts_mdb = ts_set & mdb_set
ts_mtb = ts_set & mtb_set
mdb_mtb = mdb_set & mtb_set
# 三工具交集
all_three = ts_set & mdb_set & mtb_set
# 至少两个工具预测
at_least_two = ts_mdb | ts_mtb | mdb_mtb
print(f"TargetScan预测: {len(ts_set)}")
print(f"miRDB预测: {len(mdb_set)}")
print(f"miRTarBase验证: {len(mtb_set)}")
print(f"三工具交集: {len(all_three)}")
print(f"至少两工具预测: {len(at_least_two)}")
return at_least_two
2.2 自实现种子序列匹配¶
#!/usr/bin/env python3
"""基于种子序列的miRNA靶基因预测"""
from Bio import SeqIO # 序列解析
import re # 正则表达式
def reverse_complement(seq):
"""计算互补配对序列"""
complement = {"A": "U", "U": "A", "G": "C", "C": "G",
"a": "u", "u": "a", "g": "c", "c": "g",
"T": "A", "t": "a"}
return "".join(complement.get(base, base) for base in reversed(seq))
def find_seed_matches(mirna_seq, utr_seq, mirna_name="miRNA", gene_name="gene"):
"""
在3'UTR中搜索miRNA种子序列匹配位点
种子序列 = miRNA的第2-8位
"""
seed = mirna_seq[1:8] # 提取种子序列(位置2-8)
seed_rc = reverse_complement(seed) # 种子的反向互补
# 将DNA的T转换为RNA的U
utr_rna = utr_seq.upper().replace("T", "U")
matches = []
# 搜索8mer匹配(最严格)
pattern_8mer = seed_rc + "A" # 8mer: 种子配对+A1
for m in re.finditer(pattern_8mer, utr_rna):
matches.append({
"type": "8mer",
"position": m.start(),
"matched_seq": m.group()
})
# 搜索7mer-m8匹配
pattern_7mer_m8 = seed_rc # 7mer-m8: 种子2-8配对
for m in re.finditer(pattern_7mer_m8, utr_rna):
matches.append({
"type": "7mer-m8",
"position": m.start(),
"matched_seq": m.group()
})
# 搜索6mer匹配
seed6 = mirna_seq[1:7]
seed6_rc = reverse_complement(seed6)
for m in re.finditer(seed6_rc, utr_rna):
matches.append({
"type": "6mer",
"position": m.start(),
"matched_seq": m.group()
})
if matches:
print(f"{mirna_name} → {gene_name}: 找到 {len(matches)} 个种子匹配位点")
for m in matches:
print(f" {m['type']} at position {m['position']}: {m['matched_seq']}")
return matches
# ========== 使用示例 ==========
# hsa-miR-21-5p序列
mir21 = "UAGCUUAUCAGACUGAUGUUGA"
# 示例3'UTR序列
pdcd4_utr = "AGCUUAUCAUUUUAUAUAAGCUA" # PDCD4的3'UTR(miR-21已知靶基因)
matches = find_seed_matches(mir21, pdcd4_utr, "hsa-miR-21-5p", "PDCD4")
三、lncRNA靶基因预测¶
#!/usr/bin/env python3
"""lncRNA靶基因预测方法"""
# ========== 常用lncRNA靶基因预测工具 ==========
lncrna_tools = {
"ceRNA分析": {
"ENCORI/starBase": {
"url": "https://rnasysu.com/encori/",
"method": "CLIP-seq数据挖掘",
"description": "整合了miRNA-mRNA、miRNA-lncRNA互作数据"
},
"miRcode": {
"url": "http://www.mircode.org/",
"method": "种子序列匹配",
"description": "预测lncRNA上的miRNA结合位点"
},
"LncBase v3": {
"url": "https://diana.e-ce.uth.gr/lncbasev3",
"method": "实验+预测",
"description": "lncRNA-miRNA互作数据库"
},
},
"RNA-RNA互作": {
"IntaRNA": {
"url": "http://rna.informatik.uni-freiburg.de/IntaRNA/",
"method": "热力学计算",
"description": "预测RNA-RNA结合的自由能"
},
"RNAplex": {
"url": "https://www.tbi.univie.ac.at/RNA/",
"method": "快速杂交预测",
"description": "Vienna RNA package的一部分"
},
},
"共表达分析": {
"WGCNA": {
"url": "https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/",
"method": "加权基因共表达网络",
"description": "找与lncRNA共表达的基因模块"
},
}
}
# 打印工具列表
for category, tools in lncrna_tools.items():
print(f"\n{'='*50}")
print(f" {category}")
print(f"{'='*50}")
for name, info in tools.items():
print(f" {name}")
print(f" 方法: {info['method']}")
print(f" 网址: {info['url']}")
四、ceRNA网络构建¶
#!/usr/bin/env Rscript
# ceRNA竞争性内源RNA网络构建
library(ggplot2)
library(igraph)
# ========== ceRNA假说 ==========
# lncRNA通过海绵吸附miRNA,解除miRNA对靶基因的抑制
# lncRNA ---抑制--→ miRNA ---抑制--→ mRNA
# 结果:lncRNA间接激活mRNA表达
# ========== 构建ceRNA网络 ==========
# 需要三组数据:
# 1. miRNA-mRNA互作(TargetScan/miRDB)
# 2. miRNA-lncRNA互作(miRcode/ENCORI)
# 3. 表达相关性验证
# 示例数据
mirna_mrna <- data.frame(
miRNA = c("miR-21", "miR-21", "miR-155", "miR-155", "miR-200a"),
mRNA = c("PDCD4", "PTEN", "SOCS1", "TP53INP1", "ZEB1"),
score = c(95, 88, 92, 85, 90)
)
mirna_lncrna <- data.frame(
miRNA = c("miR-21", "miR-21", "miR-155", "miR-200a"),
lncRNA = c("HOTAIR", "MALAT1", "NEAT1", "HOTAIR"),
sites = c(3, 2, 4, 1)
)
# 构建网络
edges <- rbind(
data.frame(from = mirna_mrna$miRNA, to = mirna_mrna$mRNA, type = "miRNA-mRNA"),
data.frame(from = mirna_lncrna$miRNA, to = mirna_lncrna$lncRNA, type = "miRNA-lncRNA")
)
g <- graph_from_data_frame(edges, directed = TRUE)
V(g)$type <- ifelse(V(g)$name %in% mirna_mrna$miRNA, "miRNA",
ifelse(V(g)$name %in% mirna_lncrna$lncRNA, "lncRNA", "mRNA"))
# 可视化
colors <- c("miRNA" = "red", "lncRNA" = "blue", "mRNA" = "green")
V(g)$color <- colors[V(g)$type]
plot(g,
vertex.size = 20,
vertex.label.cex = 0.8,
edge.arrow.size = 0.5,
main = "ceRNA Network")
常见报错与解决¶
| 报错信息 | 原因 | 解决方法 |
|---|---|---|
miRNA name not found | miRNA命名不规范 | 用miRBase标准名称(如hsa-miR-21-5p) |
No targets predicted | miRNA序列不对 | 确认是成熟miRNA序列,不是pre-miRNA |
Too many targets | 种子太短(6mer匹配太多) | 只看7mer-m8和8mer匹配 |
UTR sequence error | 3'UTR序列获取错误 | 从UCSC或Ensembl重新获取 |
速查表¶
========================================
非编码RNA靶基因预测 速查表
========================================
【miRNA靶基因预测工具】
TargetScan → 保守性+配对打分(最经典)
miRDB → 机器学习预测(分数>80可信)
miRTarBase → 实验验证的互作(金标准)
ENCORI/starBase → CLIP-seq数据挖掘
【种子序列匹配类型】
8mer → 最严格,最可靠
7mer-m8 → 较严格
7mer-A1 → 中等
6mer → 最宽松,假阳性多
【lncRNA靶基因预测】
ceRNA分析 → miRcode + ENCORI
RNA互作预测 → IntaRNA / RNAplex
共表达分析 → WGCNA
ChIRP-seq/CHART → 实验验证lncRNA结合位点
【lncRNA数据库】
NONCODE → 最全的ncRNA数据库
LNCipedia → 人类lncRNA注释
lncRNAdb → 功能注释的lncRNA
【miRNA数据库】
miRBase → miRNA序列和命名(权威)
mirDB → 靶基因预测
miRTarBase → 实验验证靶基因
【靶基因验证策略】
计算预测 → 多工具取交集(≥2个工具预测)
表达相关性 → miRNA上调 → 靶基因下调
实验验证 → 荧光素酶报告基因实验(金标准)
CLIP-seq → 大规模实验鉴定真实靶基因
【面试考点】
Q: miRNA的种子序列是什么?
A: miRNA第2-8位碱基,与靶基因3'UTR互补配对
Q: ceRNA假说是什么?
A: lncRNA通过竞争结合miRNA,解除miRNA对靶基因的抑制
Q: 怎么验证miRNA靶基因?
A: 荧光素酶报告基因实验(种子突变后失去抑制效果)
========================================
参考资料:TargetScan | miRDB | miRTarBase | ENCORI | Bartel, Cell 2009