Prodigy 标注工具完全指南¶

为什么要学 Prodigy¶

主动学习循环，标注效率10倍提升：Prodigy 的核心理念是"模型辅助标注"。模型先预测，你只需要确认或拒绝（二元决策比多选快得多）。模型从你的反馈中学习，越标越准。相比传统标注，效率提升 5-10 倍。
spaCy 团队出品，NLP 标注最强：Prodigy 由 spaCy 的创建者 Explosion.ai 开发。与 spaCy/Transformers 深度集成，NER、文本分类、依存分析、Span 标注等 NLP 任务的标注体验无出其右。
命令行驱动，数据科学家友好：不需要配置 Web 服务器，一条命令启动标注界面。数据科学家可以在自己的机器上快速标注，不需要 DevOps 支持。适合小团队和研究场景。
自定义 Recipe（工作流）：Prodigy 的 Recipe 系统让你用 Python 定义完整的标注工作流——数据加载、预处理、模型预测、标注界面、后处理。极高的可扩展性。
数据本地化，隐私安全：所有数据和模型都在本地运行，没有数据上传到云端。对于医疗、金融等数据敏感领域，这一点至关重要。

核心概念详解¶

Prodigy 是什么（白话解释）¶

传统标注工具的工作方式：给你一堆数据，你从头到尾一条一条标。标 1000 条可能需要好几天。

Prodigy 的方式：模型先看一遍所有数据，给出它的猜测。然后它挑出"最不确定"的数据给你看。你只需要告诉它"对"或"不对"（按空格或回车就行）。模型立刻更新，下次给你看更聪明的猜测。标完 200 条可能就比传统方式标 2000 条效果更好。

核心概念¶

概念	说明
Recipe	标注工作流脚本（数据源+模型+界面+存储）
Stream	数据流（待标注的数据序列）
Dataset	已标注数据的存储（SQLite 数据库）
Annotation Interface	标注界面类型（分类/Span/选择等）
Model-in-the-Loop	模型参与标注循环
Active Learning	优先标注模型不确定的样本
Binary Annotation	二元标注（接受/拒绝）

标注界面类型¶

界面	用途	操作方式
`classification`	文本/图像分类	接受/拒绝/跳过
`ner`	命名实体识别	高亮选择文本
`ner_manual`	手动 NER	选择文本+标签
`spans`	Span 标注	选择文本范围
`text_input`	自由文本输入	键入答案
`choice`	多选/单选	选择选项
`image`	图像分类	接受/拒绝
`image_manual`	图像框标注	画矩形框
`compare`	对比标注	选择更好的

Prodigy vs Label Studio 对比¶

特性	Prodigy	Label Studio
定位	专业标注工具（效率优先）	通用标注平台（灵活优先）
开源	闭源（买断制 $490）	开源（企业版收费）
主动学习	核心特性	通过 ML Backend
标注方式	二元决策为主	多种方式
NLP 支持	极强（spaCy 原生）	好
CV 支持	基础	强
部署	本地命令行	Web 服务
团队协作	有限（多Session）	内置
自定义	Python Recipe	XML 模板
数据存储	SQLite（本地）	DB + 文件
学习曲线	低（CLI驱动）	中
适合场景	NLP/快速迭代/研究	通用/大团队

安装与配置¶

# Prodigy 是商业软件，需要购买 License
# 购买后获得安装命令和 license key

# pip 安装（需要 license）
pip install prodigy -f https://XXXX-XXXX@download.prodi.gy

# 验证安装
prodigy --version
prodigy stats

# 默认数据存储在 ~/.prodigy/ 目录
# prodigy.json 是配置文件

配置文件¶

// ~/.prodigy/prodigy.json
{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "/path/to/data"
    }
  },
  "host": "localhost",
  "port": 8080,
  "api_keys": {},
  "show_hierarchical_labels": true,
  "custom_theme": {
    "cardMaxWidth": 800
  }
}

快速上手：5 分钟最小示例¶

文本分类（情感分析）¶

# 准备数据（texts.jsonl，每行一个JSON）
echo '{"text": "这个产品非常好用！"}' > texts.jsonl
echo '{"text": "质量太差了，退货"}' >> texts.jsonl
echo '{"text": "还行吧，一般般"}' >> texts.jsonl

# 启动标注（使用空白模型）
prodigy textcat.manual sentiment texts.jsonl --label 正面,负面,中性

# 打开 http://localhost:8080
# 标注完成后按 Ctrl+C 保存

# 查看标注统计
prodigy stats sentiment

# 导出标注数据
prodigy db-out sentiment > annotations.jsonl

NER 标注（带模型预标注）¶

# 使用中文 spaCy 模型预标注
prodigy ner.correct my_ner zh_core_web_sm texts.jsonl --label 人名,地点,组织

# 或纯手动标注
prodigy ner.manual my_ner blank:zh texts.jsonl --label 人名,地点,组织,时间

# 打开 http://localhost:8080
# 选择文本 → 选择标签 → 接受/拒绝

进阶用法¶

场景一：模型辅助的主动学习 NER¶

# 使用 Transformer 模型进行主动学习
prodigy train-curve --ner my_ner_data --lang zh --gpu-id 0

# 使用 spaCy 模型做 teach（主动学习循环）
prodigy ner.teach my_ner zh_core_web_sm texts.jsonl --label 人名,地点,组织

# 模型会自动选择最不确定的样本给你标注
# 你只需要接受（绿色勾）或拒绝（红色叉）

场景二：训练 spaCy 模型¶

# 标注完成后，直接训练 spaCy 模型
prodigy train ./output --ner my_ner_data --lang zh

# 带评估的训练
prodigy train ./output --ner my_ner_data --eval-split 0.2 --lang zh

# 使用 Transformer
prodigy train ./output --ner my_ner_data --lang zh --gpu-id 0 \
    --base-model zh_core_web_trf

场景三：自定义 Recipe¶

# my_recipe.py
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
import spacy

@prodigy.recipe(
    "custom.classify",
    dataset=("数据集名称", "positional", None, str),
    source=("数据来源文件", "positional", None, str),
    label=("标签列表", "option", "l", split_string),
)
def custom_classify(dataset, source, label=None):
    """自定义分类标注工作流"""
    nlp = spacy.blank("zh")

    stream = JSONL(source)

    # 预处理：添加额外信息
    def add_metadata(stream):
        for eg in stream:
            eg["meta"] = {
                "length": len(eg["text"]),
                "source": "custom",
            }
            # 自动添加高亮
            doc = nlp(eg["text"])
            eg["tokens"] = [{"text": t.text, "start": t.idx, "end": t.idx + len(t.text)} for t in doc]
            yield eg

    return {
        "view_id": "classification",    # 标注界面类型
        "dataset": dataset,              # 存储数据集
        "stream": add_metadata(stream),  # 数据流
        "config": {
            "labels": label or ["正面", "负面", "中性"],
            "choice_style": "single",
            "instructions": "请判断以下文本的情感倾向",
        },
    }

运行：

prodigy custom.classify my_data texts.jsonl -l 正面,负面,中性 -F my_recipe.py

场景四：图像分类¶

# 准备图像数据
echo '{"image": "path/to/image1.jpg"}' > images.jsonl
echo '{"image": "path/to/image2.jpg"}' >> images.jsonl

# 图像分类标注
prodigy image.manual my_images images.jsonl --label 猫,狗,鸟

# 图像目标检测（画框）
prodigy image.manual my_objects images.jsonl --label 猫,狗,鸟

场景五：文本校对与纠错¶

# correct_recipe.py
import prodigy
from prodigy.components.loaders import JSONL

@prodigy.recipe("text.correct")
def text_correct(dataset, source):
    """文本纠错标注"""
    stream = JSONL(source)

    def add_options(stream):
        for eg in stream:
            eg["options"] = [
                {"id": "correct", "text": "正确"},
                {"id": "typo", "text": "有错别字"},
                {"id": "grammar", "text": "语法错误"},
                {"id": "incomplete", "text": "内容不完整"},
            ]
            yield eg

    return {
        "view_id": "choice",
        "dataset": dataset,
        "stream": add_options(stream),
        "config": {
            "choice_style": "multiple",  # 可多选
        },
    }

场景六：关系抽取¶

# 1. 先标注实体
prodigy ner.manual my_ents blank:zh texts.jsonl --label 人名,组织,药物,疾病

# 2. 然后标注关系
prodigy rel.manual my_rels blank:zh texts.jsonl \
    --label 就职于,治疗,位于 \
    --span-label 人名,组织,药物,疾病

场景七：批量标注与数据管理¶

# 合并多个数据集
prodigy db-merge merged_data dataset1,dataset2,dataset3

# 查看数据集列表
prodigy stats

# 查看特定数据集统计
prodigy stats my_ner_data

# 删除数据集
prodigy drop my_temp_data

# 导出为 spaCy DocBin 格式
prodigy data-to-spacy ./corpus --ner my_ner_data --eval-split 0.2

# 导入标注数据
prodigy db-in my_data annotations.jsonl

场景八：LLM 辅助标注¶

# llm_recipe.py
import prodigy
from prodigy.components.loaders import JSONL
import openai

@prodigy.recipe("llm.classify")
def llm_classify(dataset, source, label="正面,负面,中性"):
    """使用 LLM 预标注，人工审核"""
    labels = label.split(",")
    client = openai.OpenAI()

    stream = JSONL(source)

    def add_llm_suggestions(stream):
        for eg in stream:
            # LLM 预测
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": f"将以下文本分类为{labels}之一，只回复类别名：\n{eg['text']}"
                }],
                max_tokens=10,
            )
            prediction = response.choices[0].message.content.strip()

            eg["label"] = prediction if prediction in labels else labels[0]
            eg["meta"] = {"llm_prediction": prediction}
            yield eg

    return {
        "view_id": "classification",
        "dataset": dataset,
        "stream": add_llm_suggestions(stream),
        "config": {"labels": labels},
    }

常见问题与排错¶

问题一：端口被占用¶

# 指定其他端口
prodigy textcat.manual my_data texts.jsonl --label A,B -p 9090

# 或在配置文件中修改
# ~/.prodigy/prodigy.json: {"port": 9090}

问题二：中文分词不准¶

# 安装中文 spaCy 模型
pip install spacy
python -m spacy download zh_core_web_sm

# 使用中文模型
prodigy ner.correct my_ner zh_core_web_sm texts.jsonl --label 人名,地点

问题三：数据格式问题¶

# Prodigy 期望 JSONL 格式（每行一个 JSON）
# 确保文件编码是 UTF-8

# 从 CSV 转换
python -c "
import csv, json
with open('data.csv') as f:
    for row in csv.DictReader(f):
        print(json.dumps({'text': row['content']}, ensure_ascii=False))
" > data.jsonl

问题四：如何恢复标注进度¶

Prodigy 自动保存标注到 SQLite 数据库。如果标注中途退出：

# 重新启动同一个 dataset 名称即可继续
prodigy ner.manual my_ner blank:zh new_texts.jsonl --label 人名,地点

# 查看已标注数量
prodigy stats my_ner

问题五：多人同时标注¶

# 不同标注者使用不同的 session ID
prodigy textcat.manual my_data texts.jsonl --label A,B -S annotator1
prodigy textcat.manual my_data texts.jsonl --label A,B -S annotator2

# 每个 session 的标注独立存储

问题六：导出为常用训练格式¶

# 导出为 spaCy v3 训练格式
prodigy data-to-spacy ./corpus --ner my_ner --eval-split 0.2

# 导出原始 JSONL
prodigy db-out my_data > output.jsonl

# 用 Python 转换为其他格式
python -c "
import json
with open('output.jsonl') as f:
    data = [json.loads(line) for line in f]
    # 转换为你需要的格式
"

参考资源¶

官方文档：https://prodi.gy/docs
Prodigy 购买：https://prodi.gy/buy
Recipe 参考：https://prodi.gy/docs/recipes
spaCy 文档：https://spacy.io/
Explosion.ai 博客：https://explosion.ai/blog
Prodigy Support Forum：https://support.prodi.gy/
spaCy Universe（生态）：https://spacy.io/universe
Prodigy YouTube 教程：搜索 "Prodigy annotation tutorial"