Unstructured 文档解析¶

为什么要学 Unstructured¶

Unstructured 是一个开源的文档预处理库，能够从各种非结构化文档（PDF、Word、PPT、图片、HTML、邮件等）中提取干净的文本、表格和元数据。它是构建 RAG 系统时的关键一环——没有好的文档解析，向量检索的质量就无从谈起。Unstructured 提供了统一的 API 处理 20+ 种文件格式，是 LangChain、LlamaIndex 等框架首选的文档加载器后端。

核心概念¶

概念	白话解释	用途
Partition	分区/解析	将文档拆分为结构化元素
Element	元素	解析出的最小单位（标题、段落、表格等）
Chunking	切片	将元素组合为适合嵌入的文本块
Connector	连接器	从各种数据源获取文档
Staging	暂存	将解析结果转换为特定格式
Cleaning	清洗	去除噪音（页眉页脚、多余空白等）

安装配置¶

基础安装¶

pip install unstructured

# 完整安装（包含所有格式支持）
pip install "unstructured[all-docs]"

# 按需安装
pip install "unstructured[pdf]"       # PDF 支持
pip install "unstructured[docx]"      # Word 支持
pip install "unstructured[pptx]"      # PPT 支持
pip install "unstructured[xlsx]"      # Excel 支持
pip install "unstructured[md]"        # Markdown 支持

系统依赖¶

# Ubuntu/Debian
sudo apt-get install -y \
  libmagic-dev poppler-utils tesseract-ocr \
  libreoffice pandoc

# macOS
brew install libmagic poppler tesseract libreoffice pandoc

Docker 使用（避免依赖问题）¶

docker pull quay.io/unstructured-io/unstructured:latest

docker run -v $PWD/docs:/docs \
  quay.io/unstructured-io/unstructured:latest \
  --input-dir /docs --output-dir /docs/output

快速上手¶

解析文档¶

from unstructured.partition.auto import partition

# 自动检测格式
elements = partition(filename="document.pdf")

# 查看解析结果
for element in elements:
    print(f"[{type(element).__name__}] {str(element)[:100]}")

各种格式¶

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.docx import partition_docx
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email

# PDF（高精度模式）
elements = partition_pdf(
    filename="paper.pdf",
    strategy="hi_res",           # 高精度（使用 OCR 和布局分析）
    infer_table_structure=True,  # 提取表格结构
    languages=["chi_sim", "eng"] # OCR 语言
)

# Word
elements = partition_docx(filename="report.docx")

# HTML
elements = partition_html(url="https://example.com/article")

# PPT
elements = partition_pptx(filename="slides.pptx")

# 邮件
elements = partition_email(filename="message.eml")

元素类型¶

from unstructured.documents.elements import (
    Title, NarrativeText, Table, ListItem, 
    Image, Header, Footer, FigureCaption
)

for element in elements:
    if isinstance(element, Title):
        print(f"标题: {element.text}")
    elif isinstance(element, NarrativeText):
        print(f"正文: {element.text[:50]}...")
    elif isinstance(element, Table):
        print(f"表格: {element.metadata.text_as_html}")
    elif isinstance(element, ListItem):
        print(f"列表项: {element.text}")

# 元数据
for element in elements:
    meta = element.metadata
    print(f"页码: {meta.page_number}")
    print(f"文件名: {meta.filename}")
    print(f"坐标: {meta.coordinates}")

进阶用法¶

文本切片¶

from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

# 按标题切片（保持章节完整性）
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000,
    combine_text_under_n_chars=200,
    multipage_sections=True,
)

# 基础切片
chunks = chunk_elements(
    elements,
    max_characters=500,
    overlap=50,
)

for chunk in chunks:
    print(f"[{len(chunk.text)} chars] {chunk.text[:80]}...")

PDF 高级解析¶

# 使用不同策略
elements = partition_pdf(
    filename="complex.pdf",
    strategy="hi_res",              # fast/auto/hi_res/ocr_only
    model_name="yolox",             # 布局检测模型
    infer_table_structure=True,
    extract_images_in_pdf=True,     # 提取图片
    extract_image_block_to_payload=True,
    languages=["chi_sim", "eng"],
)

# 提取表格为 DataFrame
import pandas as pd
for element in elements:
    if isinstance(element, Table):
        html = element.metadata.text_as_html
        df = pd.read_html(html)[0]
        print(df)

批量处理¶

from pathlib import Path
from unstructured.partition.auto import partition

docs_dir = Path("./documents")
all_elements = []

for file_path in docs_dir.rglob("*"):
    if file_path.suffix.lower() in [".pdf", ".docx", ".txt", ".md", ".html"]:
        try:
            elements = partition(filename=str(file_path))
            # 添加来源信息
            for el in elements:
                el.metadata.filename = str(file_path)
            all_elements.extend(elements)
            print(f"OK: {file_path.name} ({len(elements)} elements)")
        except Exception as e:
            print(f"Error: {file_path.name}: {e}")

清洗和后处理¶

from unstructured.cleaners.core import (
    clean_extra_whitespace,
    clean_non_ascii_chars,
    clean_bullets,
    remove_punctuation,
    replace_unicode_quotes,
)

for element in elements:
    element.apply(clean_extra_whitespace)
    element.apply(replace_unicode_quotes)
    element.apply(clean_non_ascii_chars)

连接器（数据源）¶

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner

# 本地目录批量处理
runner = LocalRunner(
    processor_config=ProcessorConfig(
        output_dir="./output",
        num_processes=4,
    ),
    read_config=ReadConfig(),
    connector_config=SimpleLocalConfig(
        input_path="./documents",
        recursive=True,
    ),
)
runner.run()

# 也支持：S3, GCS, Azure Blob, Confluence, 
# Notion, Slack, Google Drive, SharePoint 等

与 LangChain 集成¶

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "document.pdf",
    mode="elements",  # 返回每个元素作为独立 Document
    strategy="hi_res",
)
docs = loader.load()

常见问题¶

Q: 中文 PDF 解析效果差？¶

确保安装了 tesseract-ocr 和中文语言包
使用 strategy="hi_res" + languages=["chi_sim"]
扫描件 PDF 需要 OCR：strategy="ocr_only"

Q: 表格提取不准确？¶

使用 infer_table_structure=True
对于复杂表格，尝试 hi_res 策略
考虑使用专门的表格提取工具（如 Camelot）预处理

Q: 速度太慢？¶

fast 策略比 hi_res 快很多
简单文本 PDF 用 fast 即可
批量处理用 num_processes 开启多进程

Q: 解析结果有很多噪音（页眉页脚等）？¶

使用清洗函数或在切片时过滤：

elements = [el for el in elements 
            if not isinstance(el, (Header, Footer))]

参考资源¶

GitHub：https://github.com/Unstructured-IO/unstructured
文档：https://docs.unstructured.io/
支持格式列表：https://docs.unstructured.io/open-source/core-functionality/partitioning
API 服务：https://unstructured.io/
Cookbook：https://github.com/Unstructured-IO/unstructured/tree/main/examples