Tesseract OCR 文字识别¶

一句话概述：Tesseract 是 Google 维护的开源 OCR（光学字符识别）引擎，能从图片中提取文字，支持 100+ 种语言。

核心知识点¶

概念	白话解释
OCR	光学字符识别 = 图片变文字
Tesseract	OCR 引擎 = Google 开源的识别核心
tessdata	训练数据 = 各语言的识别模型
PSM	页面分割模式 = 告诉引擎图片里文字的排列方式
pytesseract	Python 封装 = 在 Python 中调用 Tesseract

安装配置¶

# Linux（Ubuntu/Debian）
sudo apt install tesseract-ocr                         # 安装引擎
sudo apt install tesseract-ocr-chi-sim                 # 中文简体语言包
sudo apt install tesseract-ocr-chi-tra                 # 中文繁体语言包
sudo apt install tesseract-ocr-eng                     # 英文（通常已预装）

# macOS
brew install tesseract                                 # 安装引擎
brew install tesseract-lang                            # 安装所有语言包

# 验证
tesseract --version                                   # 查看版本（推荐 5.x）
tesseract --list-langs                                 # 列出已安装的语言

# Python 封装
pip install pytesseract Pillow                         # 安装 Python 库

命令行使用¶

# 基本识别
tesseract image.png output                             # 识别 → output.txt
tesseract image.png stdout                             # 识别 → 直接输出到终端

# 指定语言
tesseract image.png output -l chi_sim                  # 中文简体
tesseract image.png output -l eng+chi_sim              # 英文+中文混合

# 指定输出格式
tesseract image.png output pdf                         # 输出可搜索 PDF
tesseract image.png output hocr                        # 输出 hOCR（带坐标的 HTML）
tesseract image.png output tsv                         # 输出 TSV（含置信度）

# 页面分割模式（PSM）
tesseract image.png output --psm 3                     # 全自动页面分割（默认）
tesseract image.png output --psm 6                     # 假设为统一的文本块
tesseract image.png output --psm 7                     # 把图片当作一行文本
tesseract image.png output --psm 13                    # 原始行，无识别

# 常用 PSM 值
# 3 = 全自动分割（默认，适合整页文档）
# 6 = 统一文本块（适合表格/段落截图）
# 7 = 单行文本
# 8 = 单个词
# 10 = 单个字符

Python 使用¶

import pytesseract                                     # 导入 pytesseract
from PIL import Image                                  # 导入 Pillow 图像库

# 基本识别
img = Image.open('document.png')                       # 打开图片
text = pytesseract.image_to_string(img, lang='chi_sim') # 中文识别
print(text)                                            # 打印识别结果

# 预处理提升准确率
import cv2                                             # OpenCV 图像处理

img = cv2.imread('document.png')                       # 读取图片
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)           # 转灰度
thresh = cv2.threshold(gray, 0, 255,                   # 二值化
    cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(                    # 识别处理后的图片
    thresh, lang='chi_sim')

# 获取详细信息（含置信度和坐标）
data = pytesseract.image_to_data(img,                  # 获取详细数据
    lang='chi_sim', output_type=pytesseract.Output.DICT)
for i, word in enumerate(data['text']):                # 遍历识别结果
    conf = int(data['conf'][i])                        # 置信度
    if conf > 60:                                      # 只取置信度>60的
        print(f"文字: {word}, 置信度: {conf}%")

# 批量处理
import glob
for img_path in glob.glob('pages/*.png'):              # 遍历所有图片
    text = pytesseract.image_to_string(
        Image.open(img_path), lang='chi_sim')
    txt_path = img_path.replace('.png', '.txt')        # 生成对应 txt
    with open(txt_path, 'w') as f:
        f.write(text)                                  # 保存识别结果

常见报错¶

报错	原因	解决
`TesseractNotFoundError`	Python 找不到 tesseract	安装 tesseract 或设置 `pytesseract.pytesseract.tesseract_cmd`
`Failed loading language 'chi_sim'`	没装中文语言包	`sudo apt install tesseract-ocr-chi-sim`
识别率低	图片质量差	预处理：灰度化+二值化+降噪
乱码	语言选错	用 `-l chi_sim` 指定正确语言

速查表¶

# 命令行
tesseract <img> <output> [-l lang] [--psm N]           # 基本用法
tesseract image.png stdout -l chi_sim+eng              # 中英混合
tesseract image.png out pdf                            # 输出 PDF
tesseract --list-langs                                 # 列出语言

# 提升识别率技巧
# 1. 图片分辨率 >= 300 DPI
# 2. 预处理：灰度化 → 二值化 → 降噪
# 3. 选择正确的 PSM 模式
# 4. 使用正确的语言包
# 5. 对于表格，先分割再识别

# 替代方案
# EasyOCR → Python 原生，准确率高，支持 GPU
# PaddleOCR → 百度出品，中文效果最好
# 云服务 → Google Vision API / 百度 OCR / 腾讯 OCR