415_NoSQL数据库选型对比¶

一句话说明¶

NoSQL就是"不只是SQL"的数据库，用不同的数据模型解决关系型数据库不擅长的场景，比如海量文档、图谱关系、时序数据。

核心知识点¶

NoSQL四大类型¶

类型	代表产品	数据模型	适用场景
键值型	Redis, DynamoDB	key → value	缓存、会话、排行榜
文档型	MongoDB, Elasticsearch	嵌套JSON文档	内容管理、分析报告
列族型	HBase, Cassandra	列族+行键	时序数据、日志、大数据
图数据库	Neo4j, ArangoDB	节点+边	社交网络、知识图谱、PPI网络

CAP定理¶

Consistency（一致性）：所有节点同时看到同样的数据
Availability（可用性）：每个请求都能得到响应
Partition Tolerance（分区容忍性）：网络分区时系统仍可运行
结论：三者只能同时满足两个！

选择	代表
CP	HBase, MongoDB（强一致）
AP	Cassandra, CouchDB（高可用）
CA	传统关系型数据库（单机）

实战代码¶

# ========== 1. Redis（键值型）：基因序列缓存 ==========
import redis
import json

r = redis.Redis(host='localhost', port=6379, db=0, decode_responses=True)

# 场景：缓存频繁查询的参考基因组序列（避免每次从文件读取）
def get_gene_sequence(gene_id: str) -> str:
    """带缓存的基因序列查询"""
    cache_key = f"gene:seq:{gene_id}"  # Redis key命名规范：实体:字段:ID

    # 先查缓存
    cached = r.get(cache_key)
    if cached:
        print(f"Cache HIT for {gene_id}")
        return cached

    print(f"Cache MISS for {gene_id}, fetching from file...")
    # 模拟从文件读取（实际从FASTA文件读取）
    sequence = "ATCGATCGATCG" * 100  # 模拟序列

    # 存入缓存，设置过期时间（3600秒 = 1小时）
    r.setex(cache_key, 3600, sequence)  # setex = SET + EXpire
    return sequence

# 示例：有序集合（Sorted Set）存储基因表达排名
def update_gene_ranking(sample_id: str, gene_expressions: dict):
    """更新基因表达排名（用于快速取Top-N高表达基因）"""
    ranking_key = f"ranking:{sample_id}"

    # 批量更新分数（expression值作为score）
    r.zadd(ranking_key, gene_expressions)  # {gene_id: expression_value}
    r.expire(ranking_key, 86400)  # 24小时过期

    # 取表达量最高的10个基因（从大到小）
    top_genes = r.zrevrange(ranking_key, 0, 9, withscores=True)
    return top_genes

# 测试
update_gene_ranking("S001", {"BRCA1": 245.7, "TP53": 89.3, "EGFR": 412.1, "KRAS": 156.8})
top = update_gene_ranking("S001", {})
print(f"Top基因: {top}")

# ========== 2. MongoDB（文档型）：存储变长的分析报告 ==========
from pymongo import MongoClient
from datetime import datetime

client = MongoClient("mongodb://localhost:27017/")
db = client["bioinformatics"]

# 好的一面：MongoDB可以存储嵌套的可变结构（不需要固定schema）
analysis_report = {
    "report_id": "RPT_001",
    "sample_id": "S001",
    "analysis_type": "variant_calling",
    "created_at": datetime.now(),
    "summary": {
        "total_variants": 15432,
        "snp_count": 14218,
        "indel_count": 1214,
        "quality_metrics": {
            "mean_depth": 45.7,
            "pct_20x_coverage": 0.95
        }
    },
    "top_variants": [  # 嵌套数组（关系型数据库需要另建表）
        {"gene": "BRCA1", "variant": "c.5266dupC", "classification": "Pathogenic"},
        {"gene": "TP53", "variant": "c.817C>T", "classification": "Likely_Pathogenic"},
    ],
    "pipeline_info": {
        "bwa_version": "0.7.17",
        "gatk_version": "4.3.0.0",
        "reference": "GRCh38"
    }
    # 不同报告可以有不同字段，灵活！
}

# 插入文档
result = db.analysis_reports.insert_one(analysis_report)
print(f"插入报告ID: {result.inserted_id}")

# 查询（类似SQL的WHERE）
pathogenic_variants = db.analysis_reports.find_one(
    {"sample_id": "S001"},                # 过滤条件（相当于WHERE）
    {"top_variants": 1, "_id": 0}         # 投影（相当于SELECT，1=包含，0=排除）
)

# 聚合管道（相当于SQL的GROUP BY + 统计）
pipeline = [
    {"$match": {"analysis_type": "variant_calling"}},  # 过滤
    {"$unwind": "$top_variants"},                       # 展开数组（每个variant变一行）
    {"$group": {
        "_id": "$top_variants.classification",           # 按classification分组
        "count": {"$sum": 1},                           # 计数
        "genes": {"$addToSet": "$top_variants.gene"}    # 收集基因列表
    }},
    {"$sort": {"count": -1}}                            # 按计数降序
]
results = list(db.analysis_reports.aggregate(pipeline))
print(f"变异分类统计: {results}")

# ========== 3. 选型决策树（Python伪代码表示） ==========
def choose_nosql(requirements: dict) -> str:
    """根据需求选择NoSQL数据库"""
    if requirements.get("need_cache") and requirements.get("low_latency_ms") < 10:
        return "Redis（内存型，延迟<1ms，适合缓存）"

    if requirements.get("variable_schema") and requirements.get("nested_data"):
        return "MongoDB（文档型，Schema灵活，适合报告/日志）"

    if requirements.get("graph_queries") or requirements.get("network_analysis"):
        return "Neo4j（图数据库，适合PPI网络/代谢网络分析）"

    if requirements.get("time_series") and requirements.get("write_heavy"):
        return "InfluxDB/TimescaleDB（时序数据库，适合测序仪实时数据）"

    if requirements.get("full_text_search"):
        return "Elasticsearch（全文检索，适合文献检索/报告搜索）"

    return "PostgreSQL（关系型，默认选择，覆盖80%场景）"

# 测试
print(choose_nosql({"need_cache": True, "low_latency_ms": 5}))
print(choose_nosql({"variable_schema": True, "nested_data": True}))
print(choose_nosql({"graph_queries": True}))

面试常问点¶

Q: 什么时候用NoSQL，什么时候用SQL？ A: SQL适合：结构化数据、需要事务、复杂查询；NoSQL适合：海量非结构化数据（文档型）、超高写入吞吐（列族型）、图关系查询（图型）、缓存（键值型）。
Q: MongoDB和关系型数据库最大的区别？ A: 无固定Schema（可存不同结构的文档）、天然支持嵌套/数组、横向扩展容易、但不支持多文档事务（4.0+版本支持有限事务）。
Q: 为什么Redis这么快？ A: 数据存在内存（而非磁盘）；单线程避免锁竞争；I/O多路复用；简单数据结构（Hash、List、Sorted Set）操作复杂度低。

速查表¶

场景	推荐数据库	理由
API响应缓存	Redis	内存，微秒级
分析报告存储	MongoDB	可变JSON结构
全文检索	Elasticsearch	倒排索引
蛋白互作网络	Neo4j	图遍历高效
时序测序数据	InfluxDB	时序压缩优化
日志分析	Cassandra/HBase	高写入吞吐