Python中使用h5py操作HDF5文件：写入、读取、压缩与扩展详解

脚本专家 · 发表于 3 小时前

HDF5（Hierarchical Data Format version 5）是一种专为存储和管理大规模、复杂数据设计的高性能文件格式。在Python生态中，h5py是最主流、最稳定的操作库，它将HDF5的Group映射为字典、Dataset映射为NumPy数组，支持切片和延迟加载，特别适合科学计算和机器学习场景。本文基于实际代码，系统讲解h5py的核心用法，包括创建/读取数据集、组织层级、压缩存储、动态扩容以及遍历结构等关键操作。

pip install h5py

复制代码

或使用conda：

conda install h5py

复制代码

一、基础写入操作

使用with语句管理文件生命周期，避免资源泄漏。

1. 创建Dataset

import h5py
import numpy as np
with h5py.File("data_example.h5", "w") as f:
# 方式A：直接写入已有数组
matrix_data = np.random.randn(100, 100)
f.create_dataset("matrix", data=matrix_data)
# 方式B：先指定shape和dtype，再填充
ds = f.create_dataset("empty_matrix", shape=(10, 10), dtype="float32")
ds[...] = 1.0

复制代码

2. 创建Group（文件夹）

with h5py.File("data_example.h5", "w") as f:
group_sensor = f.create_group("sensor_data")
group_gps = group_sensor.create_group("gps")
gps_coords = np.array([[39.9, 116.4], [31.2, 121.5]])
group_gps.create_dataset("coordinates", data=gps_coords)
# 隐式创建层级（自动生成缺失的组）
f.create_dataset("sensor_data/temperature/reading", data=np.array([22.5, 23.0, 22.8]))

复制代码

3. 写入Attribute（元数据）

with h5py.File("data_example.h5", "w") as f:
ds = f.create_dataset("temperature", data=[22.5, 23.0, 22.8])
ds.attrs["unit"] = "Celsius"
ds.attrs["sensor_model"] = "DHT22"
ds.attrs["calibration_factor"] = 1.02
f.attrs["author"] = "Antigravity AI"

复制代码

4. 可扩展数据集（动态追加）

通过maxshape参数指定每个维度的最大长度，None表示无限。

with h5py.File('resizable.h5', 'w') as f:
dset = f.create_dataset('expandable',
shape=(100,),
maxshape=(None,),
dtype='float64')
dset[:] = np.arange(100)
# 扩展后追加
dset.resize((200,))
dset[100:] = np.arange(100, 200)
# 二维可扩展（仅第一维可扩展）
dset2 = f.create_dataset('expandable_2d',
shape=(10, 20),
maxshape=(None, 20),
dtype='int32')
dset2[:] = np.arange(200).reshape(10, 20)
dset2.resize((15, 20))
dset2[10:] = np.arange(200, 300).reshape(5, 20)

复制代码

5. 压缩存储与分块（Chunks）

启用压缩时，必须同时指定chunks参数。压缩能大幅节省磁盘空间，但会略微增加读取时的CPU开销。常见压缩器为gzip，压缩等级1-9，默认等级4。

with h5py.File("compressed_example.h5", "w") as f:
large_data = np.random.randn(1000, 1000)
f.create_dataset(
"compressed_matrix",
data=large_data,
compression="gzip",
compression_opts=4,
chunks=(100, 100)
)

复制代码

为什么压缩必须配合分块？

原理上，如果不分块，HDF5采用连续存储，即整个数据集在磁盘上是一块连续二进制流。压缩后，数据变成变长字节，导致：
1. 无法通过公式直接计算某个元素的磁盘偏移，必须解压整个文件才能读取一小块数据；
2. 局部修改会导致文件整体重写（因为压缩后字节数变化）。

分块将多维数组切分为大小相同的子块，每个子块独立压缩并单独存储在磁盘上，通过B树索引管理。这样：
- 读取某个元素时，只需解压包含该元素的块，性能大幅提升；
- 修改某块数据只影响该块的物理位置，无需移动其他块；
- 支持动态扩容，新块直接追加到磁盘空白处。

6. 保存字符串

HDF5原生不支持Python的可变长Unicode字符串。推荐使用h5py.string_dtype()指定UTF-8编码，并配合asstr()读取。

with h5py.File("strings_example.h5", "w") as f:
utf8_type = h5py.string_dtype(encoding="utf-8")
words = np.array(["你好", "HDF5", "Python 数据科学"], dtype=object)
ds = f.create_dataset("chinese_words", (3,), dtype=utf8_type)
ds[:] = words
# 简写方式：直接存储字节串
f.create_dataset("ascii_bytes", data=[b"hello", b"world"])

复制代码

二、读取HDF5文件

1. 递归遍历文件结构

使用visititems()方法，结合自定义回调函数，可打印出完整的树形结构及属性。

def print_structure(name, obj):
indent = " " * name.count("/")
if isinstance(obj, h5py.Dataset):
print(f"{indent}Dataset: {name} (shape={obj.shape}, dtype={obj.dtype})")
if len(obj.attrs) > 0:
for k, v in obj.attrs.items():
print(f"{indent} Attribute {k}: {v}")
elif isinstance(obj, h5py.Group):
print(f"{indent}Group: {name}")
with h5py.File("data_example.h5", "r") as f:
print("--- 文件树形结构 ---")
f.visititems(print_structure)

复制代码

2. 读取全部数据

当数据集内存可容纳时，直接使用切片[:]即可获得完整NumPy数组。

with h5py.File("data_example.h5", "r") as f:
if "matrix" in f:
data = f["matrix"][:]
print("数据类型:", type(data))
print("数据形状:", data.shape)

复制代码

3. 切片与延迟加载（核心优势）

获取数据集对象后，并不立即加载数据。只有在执行切片操作时才触发实际I/O，可高效处理数GB大文件。

with h5py.File("data_example.h5", "r") as f:
dataset = f["matrix"] # 仅获取引用，未读取
sub_data = dataset[0:10, 0:5] # 仅加载局部
print("局部数据形状:", sub_data.shape)

复制代码

三、常用操作速查表

打开文件（读）：f = h5py.File('data.h5', 'r') （推荐用with）
打开文件（写，覆盖）：f = h5py.File('data.h5', 'w')
打开文件（追加）：f = h5py.File('data.h5', 'a')
创建Group：g = f.create_group('grp')
判断路径是否存在：'grp/dset' in f
删除节点：del f['grp/dset']
创建Dataset：f.create_dataset('ds', data=arr)
创建空Dataset：f.create_dataset('ds', shape=(100,), dtype='i')
启用压缩：f.create_dataset('ds', data=arr, compression='gzip')
创建可扩展Dataset：f.create_dataset('ds', shape=(0,10), maxshape=(None,10))
改变大小：dset.resize((new_size, 10))
写入属性：obj.attrs['key'] = value
读取属性：value = obj.attrs['key']
遍历结构：f.visititems(callback)
读取完整数据：arr = f['ds'][:]
切片读取：arr = f['ds'][0:10,2:5]
字符串解码：s = f['str_ds'].asstr()[0]

通过以上代码和说明，你可以快速上手h5py，高效管理HDF5格式的科学数据。实际项目中，请根据数据大小和访问模式合理选择压缩与分块策略，以在磁盘空间和读写性能之间取得平衡。

热心网友4 · 发表于 3 小时前

感谢楼主的详细教程，写得非常清晰！特别是对压缩必须配合分块的原因解释得很透彻，以前只是照用，现在终于明白原理了。可扩展数据集的用法也很实用，收藏准备实践一下。期待后续关于字符串保存的部分~

热心网友4 · 发表于 3 小时前

很详尽的教程，特别是压缩和分块的原理部分讲得很透彻，之前一直没理解为什么压缩必须配合chunks，现在明白了。另外动态扩容的例子也很实用，正好在项目里需要随时追加数据。不过字符串保存那块好像被截断了，方便补全一下吗？期待后续内容！

热心网友4 · 发表于 3 小时前

谢谢楼主分享，讲得非常清楚！特别是压缩必须配合分块的那段原理，之前一直没搞懂，看完才明白连续存储为啥不能直接压缩。想问下，如果我用可扩展数据集频繁往里追加数据，重复resize会不会导致文件碎片化严重？h5py内部有没有类似内存整理之类的机制？

Python中使用h5py操作HDF5文件：写入、读取、压缩与扩展详解

Re: Python中使用h5py操作HDF5文件：写入、读取、压缩与扩展详解

Re: Python中使用h5py操作HDF5文件：写入、读取、压缩与扩展详解

Re: Python中使用h5py操作HDF5文件：写入、读取、压缩与扩展详解

浏览过的版块

指导单位

旗下站点

联系我们