跳转至

Unstructured 文档解析

为什么要学 Unstructured

Unstructured 是一个开源的文档预处理库,能够从各种非结构化文档(PDF、Word、PPT、图片、HTML、邮件等)中提取干净的文本、表格和元数据。它是构建 RAG 系统时的关键一环——没有好的文档解析,向量检索的质量就无从谈起。Unstructured 提供了统一的 API 处理 20+ 种文件格式,是 LangChain、LlamaIndex 等框架首选的文档加载器后端。


核心概念

概念白话解释用途
Partition分区/解析将文档拆分为结构化元素
Element元素解析出的最小单位(标题、段落、表格等)
Chunking切片将元素组合为适合嵌入的文本块
Connector连接器从各种数据源获取文档
Staging暂存将解析结果转换为特定格式
Cleaning清洗去除噪音(页眉页脚、多余空白等)

安装配置

基础安装

pip install unstructured

# 完整安装(包含所有格式支持)
pip install "unstructured[all-docs]"

# 按需安装
pip install "unstructured[pdf]"       # PDF 支持
pip install "unstructured[docx]"      # Word 支持
pip install "unstructured[pptx]"      # PPT 支持
pip install "unstructured[xlsx]"      # Excel 支持
pip install "unstructured[md]"        # Markdown 支持

系统依赖

# Ubuntu/Debian
sudo apt-get install -y \
  libmagic-dev poppler-utils tesseract-ocr \
  libreoffice pandoc

# macOS
brew install libmagic poppler tesseract libreoffice pandoc

Docker 使用(避免依赖问题)

docker pull quay.io/unstructured-io/unstructured:latest

docker run -v $PWD/docs:/docs \
  quay.io/unstructured-io/unstructured:latest \
  --input-dir /docs --output-dir /docs/output

快速上手

解析文档

from unstructured.partition.auto import partition

# 自动检测格式
elements = partition(filename="document.pdf")

# 查看解析结果
for element in elements:
    print(f"[{type(element).__name__}] {str(element)[:100]}")

各种格式

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.docx import partition_docx
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email

# PDF(高精度模式)
elements = partition_pdf(
    filename="paper.pdf",
    strategy="hi_res",           # 高精度(使用 OCR 和布局分析)
    infer_table_structure=True,  # 提取表格结构
    languages=["chi_sim", "eng"] # OCR 语言
)

# Word
elements = partition_docx(filename="report.docx")

# HTML
elements = partition_html(url="https://example.com/article")

# PPT
elements = partition_pptx(filename="slides.pptx")

# 邮件
elements = partition_email(filename="message.eml")

元素类型

from unstructured.documents.elements import (
    Title, NarrativeText, Table, ListItem, 
    Image, Header, Footer, FigureCaption
)

for element in elements:
    if isinstance(element, Title):
        print(f"标题: {element.text}")
    elif isinstance(element, NarrativeText):
        print(f"正文: {element.text[:50]}...")
    elif isinstance(element, Table):
        print(f"表格: {element.metadata.text_as_html}")
    elif isinstance(element, ListItem):
        print(f"列表项: {element.text}")

# 元数据
for element in elements:
    meta = element.metadata
    print(f"页码: {meta.page_number}")
    print(f"文件名: {meta.filename}")
    print(f"坐标: {meta.coordinates}")

进阶用法

文本切片

from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

# 按标题切片(保持章节完整性)
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
    multipage_sections=True,
)

# 基础切片
chunks = chunk_elements(
    elements,
    max_characters=500,
    overlap=50,
)

for chunk in chunks:
    print(f"[{len(chunk.text)} chars] {chunk.text[:80]}...")

PDF 高级解析

# 使用不同策略
elements = partition_pdf(
    filename="complex.pdf",
    strategy="hi_res",              # fast/auto/hi_res/ocr_only
    model_name="yolox",             # 布局检测模型
    infer_table_structure=True,
    extract_images_in_pdf=True,     # 提取图片
    extract_image_block_to_payload=True,
    languages=["chi_sim", "eng"],
)

# 提取表格为 DataFrame
import pandas as pd
for element in elements:
    if isinstance(element, Table):
        html = element.metadata.text_as_html
        df = pd.read_html(html)[0]
        print(df)

批量处理

from pathlib import Path
from unstructured.partition.auto import partition

docs_dir = Path("./documents")
all_elements = []

for file_path in docs_dir.rglob("*"):
    if file_path.suffix.lower() in [".pdf", ".docx", ".txt", ".md", ".html"]:
        try:
            elements = partition(filename=str(file_path))
            # 添加来源信息
            for el in elements:
                el.metadata.filename = str(file_path)
            all_elements.extend(elements)
            print(f"OK: {file_path.name} ({len(elements)} elements)")
        except Exception as e:
            print(f"Error: {file_path.name}: {e}")

清洗和后处理

from unstructured.cleaners.core import (
    clean_extra_whitespace,
    clean_non_ascii_chars,
    clean_bullets,
    remove_punctuation,
    replace_unicode_quotes,
)

for element in elements:
    element.apply(clean_extra_whitespace)
    element.apply(replace_unicode_quotes)
    element.apply(clean_non_ascii_chars)

连接器(数据源)

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

# 本地目录批量处理
runner = LocalRunner(
    processor_config=ProcessorConfig(
        output_dir="./output",
        num_processes=4,
    ),
    read_config=ReadConfig(),
    connector_config=SimpleLocalConfig(
        input_path="./documents",
        recursive=True,
    ),
)
runner.run()

# 也支持:S3, GCS, Azure Blob, Confluence, 
# Notion, Slack, Google Drive, SharePoint 等

与 LangChain 集成

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "document.pdf",
    mode="elements",  # 返回每个元素作为独立 Document
    strategy="hi_res",
)
docs = loader.load()

常见问题

Q: 中文 PDF 解析效果差?

  1. 确保安装了 tesseract-ocr 和中文语言包
  2. 使用 strategy="hi_res" + languages=["chi_sim"]
  3. 扫描件 PDF 需要 OCR:strategy="ocr_only"

Q: 表格提取不准确?

  • 使用 infer_table_structure=True
  • 对于复杂表格,尝试 hi_res 策略
  • 考虑使用专门的表格提取工具(如 Camelot)预处理

Q: 速度太慢?

  • fast 策略比 hi_res 快很多
  • 简单文本 PDF 用 fast 即可
  • 批量处理用 num_processes 开启多进程

Q: 解析结果有很多噪音(页眉页脚等)?

使用清洗函数或在切片时过滤:

elements = [el for el in elements 
            if not isinstance(el, (Header, Footer))]


参考资源

  • GitHub:https://github.com/Unstructured-IO/unstructured
  • 文档:https://docs.unstructured.io/
  • 支持格式列表:https://docs.unstructured.io/open-source/core-functionality/partitioning
  • API 服务:https://unstructured.io/
  • Cookbook:https://github.com/Unstructured-IO/unstructured/tree/main/examples