h5py — Python 读写 HDF5 科学数据格式的标准库¶

一句话说明¶

h5py 让 Python 直接操作 HDF5 文件（.h5/.hdf5），像访问字典一样访问存在文件里的大型数值数组，广泛用于生信、物理、深度学习权重存储。

安装与配置¶

# pip 安装
pip install h5py                 # 当前版本 3.x

# conda 安装
conda install -c conda-forge h5py

# 验证
python -c "import h5py; print(h5py.__version__)"

# 查看 HDF5 C 库版本
python -c "import h5py; print(h5py.version.hdf5_version)"

核心用法¶

创建和写入 HDF5¶

import h5py
import numpy as np

# 创建/打开文件（with 自动关闭）
with h5py.File("experiment.h5", "w") as f:   # w=新建，覆盖已有

    # 直接写入数组（Dataset）
    f.create_dataset(
        "counts",                              # 数据集名
        data = np.random.rand(1000, 500),     # 数据
        compression = "gzip",                 # 压缩方式
        compression_opts = 4,                 # 压缩级别
        chunks = (100, 100),                  # 分块大小
    )

    # 创建子组（类似文件夹）
    grp = f.create_group("metadata")          # 创建子组
    grp.create_dataset("gene_names",          # 基因名列表
        data = np.array(["GeneA", "GeneB"], dtype="S10"))  # 字符串用 bytes

    # 添加属性（小型元数据）
    f.attrs["project"] = "宏基因组研究"       # 文件级属性
    f["counts"].attrs["units"] = "raw_counts" # 数据集级属性

读取 HDF5¶

with h5py.File("experiment.h5", "r") as f:   # r=只读

    # 查看结构（类似 ls）
    print(list(f.keys()))         # ['counts', 'metadata']

    # 读取数组（懒加载，切片才实际读取）
    counts = f["counts"]          # Dataset 对象
    print(counts.shape)           # (1000, 500)
    print(counts.dtype)           # float64

    # 只读部分（切片）
    subset = counts[0:100, :]     # 读前100行（只读这部分）
    subset = counts[...]          # 读全部（用 ... 代替 :）

    # 读属性
    print(f.attrs["project"])     # 宏基因组研究

实战案例¶

批量写大矩阵（内存友好）¶

import h5py
import numpy as np

n_samples, n_features = 100000, 10000  # 10万样本，1万特征

# 先创建空数据集，再分批写入
with h5py.File("large_data.h5", "w") as f:
    dset = f.create_dataset(
        "X",
        shape      = (n_samples, n_features),  # 总大小
        dtype      = "float32",
        chunks     = (1000, n_features),        # 每块1000行
        compression= "lzf",                     # lzf 比 gzip 更快
    )

    # 分批写入，避免内存溢出
    batch = 1000
    for i in range(0, n_samples, batch):
        end = min(i + batch, n_samples)
        dset[i:end, :] = np.random.rand(end-i, n_features).astype("float32")
        print(f"写入进度：{end}/{n_samples}")

遍历文件结构¶

def print_structure(name, obj):
    """递归打印 HDF5 文件树结构"""
    if isinstance(obj, h5py.Dataset):
        print(f"  [数据] {name}: shape={obj.shape}, dtype={obj.dtype}")
    elif isinstance(obj, h5py.Group):
        print(f"  [组]   {name}/")

with h5py.File("experiment.h5", "r") as f:
    f.visititems(print_structure)    # 遍历所有层级

常见报错与解决¶

报错	原因	解决
`OSError: Unable to open file`	文件路径错或已被占用	检查路径；用 `mode="a"` 追加而非 `"w"`
`TypeError: No conversion path`	写 Python 字符串	改用 `np.bytes_("str")` 或 `h5py.string_dtype()`
`KeyError: '...'`	键名不存在	先 `list(f.keys())` 查看
文件损坏	写到一半崩溃	用 `with` 块确保正常关闭；备份

速查表¶

操作	代码
新建文件	`h5py.File("f.h5", "w")`
追加文件	`h5py.File("f.h5", "a")`
只读文件	`h5py.File("f.h5", "r")`
创建数据集	`f.create_dataset("name", data=arr)`
创建分组	`f.create_group("grp")`
读数据	`f["name"][...]`
读切片	`f["name"][0:100]`
读属性	`f.attrs["key"]`
查看结构	`list(f.keys())`