跳转至

Great Expectations — 企业级数据质量验证框架


一句话说明

Great Expectations(GX)让你用"期望"(Expectation)规则描述数据应该长什么样,自动检测实际数据是否符合要求,生成漂亮的 HTML 数据质量报告,是数据管道的"测试套件"。


安装与配置

# pip 安装
pip install great-expectations         # 当前版本 1.x(GX Core)

# 验证
python -c "import great_expectations as gx; print(gx.__version__)"

核心用法

快速验证 pandas DataFrame

import great_expectations as gx        # 导入
import pandas as pd

# 加载数据
df = pd.DataFrame({
    "age":    [25, 30, -1, 200, 45],   # 含异常值
    "email":  ["a@b.com", "bad", "c@d.com", None, "e@f.com"],
    "salary": [5000, 6000, 7000, 8000, None],
})

# 创建 GX 数据源(Ephemeral,适合快速测试)
context = gx.get_context()             # 获取上下文(内存模式)

# 从 pandas 创建数据资产
ds = context.data_sources.add_pandas("my_pandas")
asset = ds.add_dataframe_asset("patient_data")
batch_def = asset.add_batch_definition_whole_dataframe("batch")

batch = batch_def.get_batch(batch_parameters={"dataframe": df})

定义期望(Expectations)

# 创建期望套件
suite = context.suites.add(gx.ExpectationSuite(name="patient_checks"))

# 添加各种期望规则
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="age",               # 检查 age 列
    min_value=0,                # 最小值 0
    max_value=120,              # 最大值 120
))

suite.add_expectation(gx.expectations.ExpectColumnValuesToMatchRegex(
    column="email",             # 检查 email 列
    regex=r"^[^@]+@[^@]+\.[^@]+$",  # 邮箱格式正则
))

suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(
    column="salary",            # salary 不能为空
))

suite.add_expectation(gx.expectations.ExpectColumnToExist(
    column="age",               # 列必须存在
))

运行验证

# 创建验证定义
validation_def = context.validation_definitions.add(
    gx.ValidationDefinition(
        name       = "patient_validation",
        data       = batch_def,
        suite      = suite,
    )
)

# 执行验证
results = validation_def.run(batch_parameters={"dataframe": df})

# 查看结果
print(results.success)         # True/False:整体是否通过
for r in results.results:
    print(r["expectation_config"]["expectation_type"],
          "✓" if r["success"] else "✗",
          r.get("result", {}).get("unexpected_count", 0), "个异常值")

实战案例

生信数据管道质检

import great_expectations as gx
import pandas as pd

# 菌群丰度表验证
otu_df = pd.read_csv("otu_table.csv", index_col=0)

context = gx.get_context()
ds      = context.data_sources.add_pandas("bioinf")
asset   = ds.add_dataframe_asset("otu_table")
batch_def = asset.add_batch_definition_whole_dataframe("bd")

suite = context.suites.add(gx.ExpectationSuite(name="otu_checks"))

# OTU 表的质量规则
suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(
    min_value=10, max_value=10000        # 至少 10 个样本
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="Shannon",                    # Shannon 多样性指数
    min_value=0.0, max_value=10.0
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(
    column="sample_id"                   # 样本 ID 不能缺失
))

vd = context.validation_definitions.add(
    gx.ValidationDefinition(name="otu_vd", data=batch_def, suite=suite)
)
results = vd.run(batch_parameters={"dataframe": otu_df})
print(f"数据质量检查:{'通过' if results.success else '失败'}")

常见报错与解决

报错原因解决
DataContextError找不到项目目录gx.get_context() 而非手动初始化
ExpectationNotFoundError期望类名写错查文档,类名区分大小写
TypeError: unhashable列数据含字典先预处理列,转为基本类型

速查表

操作代码
初始化gx.get_context()
添加 pandas 源context.data_sources.add_pandas("name")
创建套件context.suites.add(gx.ExpectationSuite(name=...))
值范围检查ExpectColumnValuesToBeBetween(column, min, max)
非空检查ExpectColumnValuesToNotBeNull(column)
正则匹配ExpectColumnValuesToMatchRegex(column, regex)
行数范围ExpectTableRowCountToBeBetween(min, max)
运行验证validation_def.run(...)