Crawl4AI：LLM友好的网页智能抓取框架¶

为什么要学 Crawl4AI¶

在大模型时代，高质量数据是 AI 应用的基础。传统爬虫工具（如 Scrapy、BeautifulSoup）输出的原始 HTML 对 LLM 来说噪声太多——广告、导航栏、脚本标签混杂其中，既浪费 token 又降低理解质量。

Crawl4AI 专门为 LLM 场景设计，核心理念是将网页转换为 LLM 可直接消费的干净结构化数据。它能自动去噪、提取正文、转为 Markdown，还支持用 LLM 做结构化抽取。如果你在做 RAG、知识库构建、数据标注、竞品分析等任务，Crawl4AI 可以大幅减少数据预处理工作量。

适用场景： - 为 RAG 系统批量抓取文档/博客 - 构建行业知识库 - 竞品信息监控 - 学术论文元数据采集 - 自动化数据管道中的网页数据源

核心概念¶

白话解释¶

概念	白话说明
AsyncWebCrawler	异步爬虫主类，支持并发抓取多个页面
CrawlResult	抓取结果对象，包含 HTML、Markdown、提取数据等
Chunking Strategy	将长文本切分为适合 LLM 处理的片段
Extraction Strategy	从网页中提取结构化数据的策略（CSS选择器/LLM/JSON等）
LLMExtractionStrategy	用大模型从网页内容中抽取结构化字段
CosineStrategy	基于余弦相似度的语义分块策略
Browser Config	浏览器配置（无头模式、代理、User-Agent等）
Crawler Run Config	单次抓取的配置（等待时间、JS执行、截图等）
Markdown Generator	将 HTML 转为干净 Markdown 的生成器
Content Filter	内容过滤器，去除噪声保留正文

架构概览¶

用户代码
  │
  ▼
AsyncWebCrawler (浏览器管理 + 并发调度)
  │
  ├── BrowserConfig (Chromium配置)
  ├── CrawlerRunConfig (抓取参数)
  │
  ▼
网页加载 → JS执行 → 内容提取 → 分块 → 结构化抽取
  │
  ▼
CrawlResult (html / markdown / extracted_content / links / media)

安装配置¶

系统要求¶

Python 3.9+
操作系统：Linux / macOS / Windows (WSL推荐)

安装步骤¶

# 推荐使用 pip 安装
pip install crawl4ai

# 安装完成后，初始化浏览器（下载 Chromium）
crawl4ai-setup

# 如果需要使用 LLM 提取策略，还需安装对应依赖
pip install crawl4ai[all]

验证安装¶

python -c "from crawl4ai import AsyncWebCrawler; print('Crawl4AI 安装成功')"

Docker 方式（可选）¶

docker pull unclecode/crawl4ai
docker run -p 11235:11235 unclecode/crawl4ai

快速上手¶

最简示例：抓取网页转 Markdown¶

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://docs.python.org/3/tutorial/index.html")

        # 查看干净的 Markdown 输出
        print(result.markdown[:500])

        # 查看提取的链接
        print(f"发现 {len(result.links['internal'])} 个内部链接")

asyncio.run(main())

抓取并过滤内容¶

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # 使用内容过滤器去除噪声
    md_generator = DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.4)
    )

    config = CrawlerRunConfig(
        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/blog/post",
            config=config
        )
        # fit_markdown 是经过过滤的精简版本
        print(result.markdown.fit_markdown)

asyncio.run(main())

使用 CSS 选择器提取结构化数据¶

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # 定义提取 schema
    schema = {
        "name": "文章列表",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "date", "selector": ".date", "type": "text"},
            {"name": "summary", "selector": ".excerpt", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(schema)
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com/blog", config=config)
        data = json.loads(result.extracted_content)
        for item in data:
            print(f"标题: {item['title']}, 日期: {item['date']}")

asyncio.run(main())

进阶用法¶

1. 使用 LLM 进行智能结构化抽取¶

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List

class Product(BaseModel):
    name: str
    price: str
    description: str
    rating: float

async def main():
    extraction = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",
        api_token="your-api-key",
        schema=Product.model_json_schema(),
        instruction="从页面中提取所有产品信息"
    )

    config = CrawlerRunConfig(extraction_strategy=extraction)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com/products", config=config)
        products = json.loads(result.extracted_content)
        print(f"提取到 {len(products)} 个产品")

asyncio.run(main())

2. 处理动态加载页面（JS 渲染）¶

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_config = BrowserConfig(
        headless=True,
        java_script_enabled=True
    )

    run_config = CrawlerRunConfig(
        # 等待特定元素出现
        wait_for="css:.content-loaded",
        # 执行自定义 JS
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        # 页面加载后延迟
        delay_before_return_html=2.0
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com/dynamic-page",
            config=run_config
        )
        print(result.markdown[:1000])

asyncio.run(main())

3. 批量并发抓取¶

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    urls = [
        "https://docs.python.org/3/library/asyncio.html",
        "https://docs.python.org/3/library/typing.html",
        "https://docs.python.org/3/library/pathlib.html",
    ]

    config = CrawlerRunConfig(cache_mode="bypass")

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls, config=config)

        for result in results:
            if result.success:
                print(f"✓ {result.url}: {len(result.markdown)} 字符")
            else:
                print(f"✗ {result.url}: {result.error_message}")

asyncio.run(main())

4. 会话保持与多步抓取¶

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(session_id="my_session")

    async with AsyncWebCrawler() as crawler:
        # 第一步：登录页面
        login_config = CrawlerRunConfig(
            session_id="my_session",
            js_code="""
                document.querySelector('#username').value = 'user';
                document.querySelector('#password').value = 'pass';
                document.querySelector('#login-btn').click();
            """,
            wait_for="css:.dashboard"
        )
        await crawler.arun(url="https://example.com/login", config=login_config)

        # 第二步：抓取需要登录的页面
        result = await crawler.arun(
            url="https://example.com/dashboard/data",
            config=config
        )
        print(result.markdown[:500])

asyncio.run(main())

5. 配合 RAG Pipeline 使用¶

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.chunking_strategy import OverlappingWindowChunking

async def build_knowledge_chunks(urls: list[str]):
    """抓取多个 URL 并切分为适合向量化的 chunks"""
    chunker = OverlappingWindowChunking(
        window_size=500,
        overlap=50
    )

    config = CrawlerRunConfig(chunking_strategy=chunker)

    all_chunks = []
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls, config=config)

        for result in results:
            if result.success:
                for chunk in result.markdown.chunks:
                    all_chunks.append({
                        "text": chunk,
                        "source_url": result.url,
                        "title": result.metadata.get("title", "")
                    })

    return all_chunks

6. 代理与反爬配置¶

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(
    proxy="http://user:pass@proxy-server:8080",
    headers={
        "Accept-Language": "zh-CN,zh;q=0.9",
    },
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)

常见问题¶

Q1: 抓取结果为空或内容不完整¶

原因： 页面使用 JavaScript 动态渲染，内容还未加载完成。

解决：

config = CrawlerRunConfig(
    wait_for="css:#main-content",  # 等待关键元素
    delay_before_return_html=3.0,   # 额外等待
    js_code="window.scrollTo(0, document.body.scrollHeight);"  # 触发懒加载
)

Q2: 如何处理需要登录的网站¶

使用 session_id 保持会话状态，先执行登录操作的 JS，再抓取目标页面。参见进阶用法第4节。

Q3: 抓取速度慢¶

使用 arun_many 并发抓取
设置 cache_mode="enabled" 利用缓存
减少不必要的 delay_before_return_html
如果不需要 JS 渲染，使用轻量模式

Q4: Markdown 输出还是有噪声¶

使用 PruningContentFilter 或 BM25ContentFilter 进行内容过滤：

from crawl4ai.content_filter_strategy import BM25ContentFilter

filter = BM25ContentFilter(user_query="你关注的主题关键词")

Q5: 与 Scrapy 的区别是什么¶

维度	Crawl4AI	Scrapy
定位	LLM数据管道	通用爬虫框架
输出格式	Markdown/结构化JSON	原始数据
JS渲染	内置（Playwright）	需额外配置Splash
LLM集成	原生支持	无
学习曲线	低	中等
大规模分布式	有限	强

Q6: 如何节省 LLM 抽取的 token 成本¶

先用 PruningContentFilter 精简内容
使用 fit_markdown 而非完整 markdown
选择便宜的模型（如 gpt-4o-mini）
用 CSS 选择器提取能解决的场景就不用 LLM

参考资源¶

资源	链接
GitHub 仓库	https://github.com/unclecode/crawl4ai
官方文档	https://docs.crawl4ai.com
PyPI	https://pypi.org/project/crawl4ai
示例集合	https://github.com/unclecode/crawl4ai/tree/main/docs/examples
Docker Hub	https://hub.docker.com/r/unclecode/crawl4ai

小结： Crawl4AI 填补了传统爬虫与 LLM 应用之间的空白。它不是要替代 Scrapy，而是在"为 AI 准备数据"这个细分场景上做到极致。如果你在构建任何需要网页数据的 AI 应用，Crawl4AI 应该是你的首选工具之一。