DocTR 文档 OCR¶

为什么要学 DocTR¶

DocTR（Document Text Recognition）是一个端到端的文档 OCR 库，由 Mindee 开发。它结合了文本检测和文本识别两个深度学习模型，能够从文档图像中准确提取文字。相比 Tesseract 等传统 OCR，DocTR 在复杂版面、旋转文字、低质量图像上表现更好。对于需要从扫描件、照片中提取文字的应用（如发票处理、身份证识别、文档数字化），DocTR 是现代化的首选工具。

核心概念¶

概念	白话解释	用途
Text Detection	文字检测	找到图像中文字的位置（边界框）
Text Recognition	文字识别	将检测到的文字图像转为文本
OCR Predictor	端到端预测器	检测+识别的组合流水线
Document	文档对象	OCR 结果的结构化表示
Page/Block/Line/Word	层级结构	页面→区块→行→单词
Geometry	几何信息	文字的坐标和旋转角度

安装配置¶

安装¶

# PyTorch 后端
pip install "python-doctr[torch]"

# TensorFlow 后端
pip install "python-doctr[tf]"

# 带额外功能
pip install "python-doctr[torch,viz]"  # 可视化支持

验证安装¶

from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True)
print("DocTR 安装成功")

快速上手¶

基本 OCR¶

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# 加载模型
model = ocr_predictor(pretrained=True)

# 加载文档
doc = DocumentFile.from_pdf("document.pdf")
# 或从图像
# doc = DocumentFile.from_images("scan.jpg")
# 批量图像
# doc = DocumentFile.from_images(["page1.jpg", "page2.jpg"])

# 执行 OCR
result = model(doc)

# 打印结果
print(result.render())

结构化结果¶

# 遍历结果层级
for page in result.pages:
    print(f"Page dimensions: {page.dimensions}")
    for block in page.blocks:
        for line in block.lines:
            line_text = " ".join([word.value for word in line.words])
            print(f"  Line: {line_text}")
            for word in line.words:
                print(f"    Word: {word.value} (confidence: {word.confidence:.2f})")

# 导出为字典
json_output = result.export()

# 导出为 JSON
import json
with open("ocr_result.json", "w") as f:
    json.dump(result.export(), f, ensure_ascii=False, indent=2)

可视化¶

from doctr.utils.visualization import visualize_page

# 显示检测结果
for page in result.pages:
    visualize_page(page.export(), page.dimensions, interactive=True)

# 合成结果图
result.show()

进阶用法¶

自定义模型组合¶

from doctr.models import ocr_predictor, detection_predictor, recognition_predictor

# 选择不同的检测和识别模型
det_model = detection_predictor("db_resnet50", pretrained=True)
rec_model = recognition_predictor("crnn_vgg16_bn", pretrained=True)

# 组合
model = ocr_predictor(
    det_arch=det_model,
    reco_arch=rec_model,
    pretrained=True
)

# 可选的检测模型：
# db_resnet50, db_mobilenet_v3_large, linknet_resnet18, ...

# 可选的识别模型：
# crnn_vgg16_bn, crnn_mobilenet_v3_small, sar_resnet31, master, ...

中文 OCR¶

# 中文识别需要使用支持中文的模型
model = ocr_predictor(
    det_arch='db_resnet50',
    reco_arch='crnn_vgg16_bn',
    pretrained=True,
    # 检测模型通常与语言无关
    # 识别模型需要包含中文字符集
)

# 使用多语言模型
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True, assume_straight_pages=False)

批量处理¶

import os
from pathlib import Path
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)

input_dir = Path("./scans")
output_dir = Path("./ocr_results")
output_dir.mkdir(exist_ok=True)

for file_path in input_dir.glob("*.pdf"):
    print(f"Processing: {file_path.name}")
    doc = DocumentFile.from_pdf(str(file_path))
    result = model(doc)

    # 保存文本
    text = result.render()
    output_file = output_dir / f"{file_path.stem}.txt"
    output_file.write_text(text, encoding="utf-8")

    # 保存 JSON
    json_file = output_dir / f"{file_path.stem}.json"
    import json
    json_file.write_text(
        json.dumps(result.export(), ensure_ascii=False, indent=2),
        encoding="utf-8"
    )

GPU 加速¶

import torch
from doctr.models import ocr_predictor

# 自动使用 GPU
model = ocr_predictor(pretrained=True).cuda()

# 检查设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 批处理大小优化
model = ocr_predictor(
    pretrained=True,
    det_bs=4,    # 检测批次大小
    reco_bs=128, # 识别批次大小
)

预处理优化¶

from doctr.transforms import (
    Resize, ColorInversion, RandomBrightness,
    RandomContrast, GaussianNoise
)

# 对低质量图像预处理
from PIL import Image, ImageEnhance, ImageFilter

def preprocess_image(image_path):
    img = Image.open(image_path)

    # 增强对比度
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(1.5)

    # 锐化
    img = img.filter(ImageFilter.SHARPEN)

    # 二值化（对扫描件有效）
    img = img.convert("L").point(lambda x: 0 if x < 128 else 255, "1")

    return img

部署为 API¶

from fastapi import FastAPI, UploadFile
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
import tempfile

app = FastAPI()
model = ocr_predictor(pretrained=True)

@app.post("/ocr")
async def perform_ocr(file: UploadFile):
    with tempfile.NamedTemporaryFile(suffix=file.filename) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp.flush()

        if file.filename.endswith(".pdf"):
            doc = DocumentFile.from_pdf(tmp.name)
        else:
            doc = DocumentFile.from_images(tmp.name)

        result = model(doc)
        return {
            "text": result.render(),
            "details": result.export()
        }

常见问题¶

Q: 与 Tesseract 对比？¶

Tesseract：经典 OCR，规则印刷体效果好，轻量
DocTR：深度学习，复杂版面和低质量图更强，可以训练自定义模型

Q: 如何提高准确率？¶

确保图像分辨率 ≥ 300 DPI
使用 assume_straight_pages=False 处理旋转文档
对扫描件做预处理（去噪、增强对比度）
选择更大的模型（如 db_resnet50 + sar_resnet31）

Q: 支持手写体吗？¶

有一定能力，但效果不如印刷体。手写体建议使用专门的手写识别模型或服务。

Q: 处理速度慢怎么优化？¶

使用 GPU 加速
增大批处理大小（det_bs, reco_bs）
使用轻量模型（如 db_mobilenet_v3_large）
对 PDF 只处理需要 OCR 的页面

参考资源¶

GitHub：https://github.com/mindee/doctr
文档：https://mindee.github.io/doctr/
模型库：https://mindee.github.io/doctr/models.html
在线 Demo：https://demo.mindee.net/
论文：https://arxiv.org/abs/2111.08189