Unstructured 文档解析¶
为什么要学 Unstructured¶
Unstructured 是一个开源的文档预处理库,能够从各种非结构化文档(PDF、Word、PPT、图片、HTML、邮件等)中提取干净的文本、表格和元数据。它是构建 RAG 系统时的关键一环——没有好的文档解析,向量检索的质量就无从谈起。Unstructured 提供了统一的 API 处理 20+ 种文件格式,是 LangChain、LlamaIndex 等框架首选的文档加载器后端。
核心概念¶
| 概念 | 白话解释 | 用途 |
|---|---|---|
| Partition | 分区/解析 | 将文档拆分为结构化元素 |
| Element | 元素 | 解析出的最小单位(标题、段落、表格等) |
| Chunking | 切片 | 将元素组合为适合嵌入的文本块 |
| Connector | 连接器 | 从各种数据源获取文档 |
| Staging | 暂存 | 将解析结果转换为特定格式 |
| Cleaning | 清洗 | 去除噪音(页眉页脚、多余空白等) |
安装配置¶
基础安装¶
pip install unstructured
# 完整安装(包含所有格式支持)
pip install "unstructured[all-docs]"
# 按需安装
pip install "unstructured[pdf]" # PDF 支持
pip install "unstructured[docx]" # Word 支持
pip install "unstructured[pptx]" # PPT 支持
pip install "unstructured[xlsx]" # Excel 支持
pip install "unstructured[md]" # Markdown 支持
系统依赖¶
# Ubuntu/Debian
sudo apt-get install -y \
libmagic-dev poppler-utils tesseract-ocr \
libreoffice pandoc
# macOS
brew install libmagic poppler tesseract libreoffice pandoc
Docker 使用(避免依赖问题)¶
docker pull quay.io/unstructured-io/unstructured:latest
docker run -v $PWD/docs:/docs \
quay.io/unstructured-io/unstructured:latest \
--input-dir /docs --output-dir /docs/output
快速上手¶
解析文档¶
from unstructured.partition.auto import partition
# 自动检测格式
elements = partition(filename="document.pdf")
# 查看解析结果
for element in elements:
print(f"[{type(element).__name__}] {str(element)[:100]}")
各种格式¶
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.docx import partition_docx
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.email import partition_email
# PDF(高精度模式)
elements = partition_pdf(
filename="paper.pdf",
strategy="hi_res", # 高精度(使用 OCR 和布局分析)
infer_table_structure=True, # 提取表格结构
languages=["chi_sim", "eng"] # OCR 语言
)
# Word
elements = partition_docx(filename="report.docx")
# HTML
elements = partition_html(url="https://example.com/article")
# PPT
elements = partition_pptx(filename="slides.pptx")
# 邮件
elements = partition_email(filename="message.eml")
元素类型¶
from unstructured.documents.elements import (
Title, NarrativeText, Table, ListItem,
Image, Header, Footer, FigureCaption
)
for element in elements:
if isinstance(element, Title):
print(f"标题: {element.text}")
elif isinstance(element, NarrativeText):
print(f"正文: {element.text[:50]}...")
elif isinstance(element, Table):
print(f"表格: {element.metadata.text_as_html}")
elif isinstance(element, ListItem):
print(f"列表项: {element.text}")
# 元数据
for element in elements:
meta = element.metadata
print(f"页码: {meta.page_number}")
print(f"文件名: {meta.filename}")
print(f"坐标: {meta.coordinates}")
进阶用法¶
文本切片¶
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
# 按标题切片(保持章节完整性)
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000,
combine_text_under_n_chars=200,
multipage_sections=True,
)
# 基础切片
chunks = chunk_elements(
elements,
max_characters=500,
overlap=50,
)
for chunk in chunks:
print(f"[{len(chunk.text)} chars] {chunk.text[:80]}...")
PDF 高级解析¶
# 使用不同策略
elements = partition_pdf(
filename="complex.pdf",
strategy="hi_res", # fast/auto/hi_res/ocr_only
model_name="yolox", # 布局检测模型
infer_table_structure=True,
extract_images_in_pdf=True, # 提取图片
extract_image_block_to_payload=True,
languages=["chi_sim", "eng"],
)
# 提取表格为 DataFrame
import pandas as pd
for element in elements:
if isinstance(element, Table):
html = element.metadata.text_as_html
df = pd.read_html(html)[0]
print(df)
批量处理¶
from pathlib import Path
from unstructured.partition.auto import partition
docs_dir = Path("./documents")
all_elements = []
for file_path in docs_dir.rglob("*"):
if file_path.suffix.lower() in [".pdf", ".docx", ".txt", ".md", ".html"]:
try:
elements = partition(filename=str(file_path))
# 添加来源信息
for el in elements:
el.metadata.filename = str(file_path)
all_elements.extend(elements)
print(f"OK: {file_path.name} ({len(elements)} elements)")
except Exception as e:
print(f"Error: {file_path.name}: {e}")
清洗和后处理¶
from unstructured.cleaners.core import (
clean_extra_whitespace,
clean_non_ascii_chars,
clean_bullets,
remove_punctuation,
replace_unicode_quotes,
)
for element in elements:
element.apply(clean_extra_whitespace)
element.apply(replace_unicode_quotes)
element.apply(clean_non_ascii_chars)
连接器(数据源)¶
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import LocalRunner
# 本地目录批量处理
runner = LocalRunner(
processor_config=ProcessorConfig(
output_dir="./output",
num_processes=4,
),
read_config=ReadConfig(),
connector_config=SimpleLocalConfig(
input_path="./documents",
recursive=True,
),
)
runner.run()
# 也支持:S3, GCS, Azure Blob, Confluence,
# Notion, Slack, Google Drive, SharePoint 等
与 LangChain 集成¶
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader(
"document.pdf",
mode="elements", # 返回每个元素作为独立 Document
strategy="hi_res",
)
docs = loader.load()
常见问题¶
Q: 中文 PDF 解析效果差?¶
- 确保安装了
tesseract-ocr和中文语言包 - 使用
strategy="hi_res"+languages=["chi_sim"] - 扫描件 PDF 需要 OCR:
strategy="ocr_only"
Q: 表格提取不准确?¶
- 使用
infer_table_structure=True - 对于复杂表格,尝试
hi_res策略 - 考虑使用专门的表格提取工具(如 Camelot)预处理
Q: 速度太慢?¶
fast策略比hi_res快很多- 简单文本 PDF 用
fast即可 - 批量处理用
num_processes开启多进程
Q: 解析结果有很多噪音(页眉页脚等)?¶
使用清洗函数或在切片时过滤:
参考资源¶
- GitHub:https://github.com/Unstructured-IO/unstructured
- 文档:https://docs.unstructured.io/
- 支持格式列表:https://docs.unstructured.io/open-source/core-functionality/partitioning
- API 服务:https://unstructured.io/
- Cookbook:https://github.com/Unstructured-IO/unstructured/tree/main/examples