Embedchain 嵌入框架¶

为什么要学 Embedchain¶

Embedchain（现为 mem0 生态的一部分）是一个极简的 RAG 框架，只需几行代码就能将任意数据源（URL、PDF、文档、视频等）转化为可问答的知识库。它隐藏了文档解析、切片、嵌入和检索的复杂性，让开发者专注于应用逻辑。对于快速构建基于文档的 AI 问答应用来说，Embedchain 的代码量可能是最少的。

核心概念¶

概念	白话解释	用途
App	应用实例	一个完整的 RAG 应用
Data Source	数据源	支持多种输入格式
Chunking	文本切片	自动将文档切分为小段
Embedding	嵌入	将文本转为向量
Vector Store	向量存储	存储和检索向量
LLM	语言模型	基于检索结果生成回答

安装配置¶

安装¶

pip install embedchain

# 带额外依赖
pip install "embedchain[github,youtube,pdf]"

基础配置¶

import os
os.environ["OPENAI_API_KEY"] = "sk-your-key"

from embedchain import App
app = App()

使用本地模型¶

from embedchain import App

config = {
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "llama3",
            "base_url": "http://localhost:11434",
            "temperature": 0.5,
        }
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
            "base_url": "http://localhost:11434",
        }
    },
    "vectordb": {
        "provider": "chroma",
        "config": {
            "dir": "./chroma_db",
            "collection_name": "my-knowledge"
        }
    }
}

app = App.from_config(config=config)

快速上手¶

三行代码构建知识库¶

from embedchain import App

app = App()

# 添加数据
app.add("https://en.wikipedia.org/wiki/Python_(programming_language)")
app.add("https://docs.python.org/3/tutorial/")

# 提问
answer = app.query("Python 的主要特点是什么？")
print(answer)

支持的数据源¶

# 网页
app.add("https://example.com/article")

# PDF（本地或URL）
app.add("/path/to/document.pdf", data_type="pdf_file")
app.add("https://example.com/paper.pdf", data_type="pdf_file")

# YouTube 视频（提取字幕）
app.add("https://www.youtube.com/watch?v=xxx", data_type="youtube_video")

# 纯文本
app.add("这是一段直接输入的知识文本", data_type="text")

# Markdown
app.add("/path/to/readme.md", data_type="mdx")

# CSV
app.add("/path/to/data.csv", data_type="csv")

# JSON
app.add("/path/to/data.json", data_type="json")

# GitHub 仓库
app.add("https://github.com/user/repo", data_type="github")

# Notion
app.add("notion_page_id", data_type="notion")

# Docx
app.add("/path/to/document.docx", data_type="docx")

对话模式¶

# 单次问答
answer = app.query("什么是RAG？")

# 带聊天历史的对话
answer = app.chat("你好，请介绍一下你自己")
answer = app.chat("刚才你说了什么？")  # 保留上下文

# 流式输出
for chunk in app.chat("解释量子计算", stream=True):
    print(chunk, end="", flush=True)

进阶用法¶

自定义配置¶

# config.yaml
app:
  config:
    name: "my-rag-app"
    collect_metrics: false

llm:
  provider: openai
  config:
    model: gpt-4
    temperature: 0.3
    max_tokens: 1000
    system_prompt: |
      你是专业技术顾问。基于提供的上下文回答问题。
      如果上下文中没有相关信息，说"我没有找到相关信息"。
      回答要简洁准确。

embedder:
  provider: openai
  config:
    model: text-embedding-3-small

chunker:
  config:
    chunk_size: 500
    chunk_overlap: 50

vectordb:
  provider: chroma
  config:
    collection_name: tech-docs
    dir: ./vector_store

app = App.from_config(yaml_path="config.yaml")

元数据和过滤¶

# 添加时附带元数据
app.add("https://docs.python.org/3/", 
        metadata={"category": "python", "version": "3.12"})

app.add("https://go.dev/doc/", 
        metadata={"category": "golang", "version": "1.22"})

# 查询时过滤
answer = app.query(
    "如何处理错误？",
    where={"category": "python"}  # 只搜索 Python 文档
)

部署为 API¶

from embedchain import App
from fastapi import FastAPI

fast_app = FastAPI()
ec_app = App()

@fast_app.post("/add")
def add_source(url: str):
    ec_app.add(url)
    return {"status": "added"}

@fast_app.post("/query")
def query(question: str):
    answer = ec_app.query(question)
    return {"answer": answer}

@fast_app.post("/chat")
def chat(message: str):
    answer = ec_app.chat(message)
    return {"answer": answer}

自定义数据加载器¶

from embedchain.loaders.base_loader import BaseLoader

class CustomAPILoader(BaseLoader):
    def load_data(self, url):
        """从自定义 API 加载数据"""
        import requests
        response = requests.get(url)
        data = response.json()

        documents = []
        for item in data["results"]:
            documents.append({
                "content": item["text"],
                "meta_data": {
                    "url": url,
                    "id": item["id"]
                }
            })
        return {"doc_id": url, "data": documents}

# 注册自定义加载器
app.add("https://api.example.com/data", 
        loader=CustomAPILoader())

评估 RAG 质量¶

from embedchain.evaluation import Evaluation

# 准备测试数据
test_data = [
    {
        "question": "Python 发布于哪一年？",
        "ground_truth": "1991年",
    },
    {
        "question": "Python 的创造者是谁？",
        "ground_truth": "Guido van Rossum",
    }
]

# 运行评估
evaluator = Evaluation(app)
results = evaluator.evaluate(test_data)
print(f"准确率: {results['accuracy']}")
print(f"平均相关性: {results['relevancy']}")

常见问题¶

Q: 如何处理中文文档？¶

确保使用支持中文的模型：

config = {
    "embedder": {
        "provider": "huggingface",
        "config": {"model": "BAAI/bge-base-zh-v1.5"}
    }
}

Q: 向量库数据持久化在哪？¶

默认使用 Chroma，数据存储在 ./db 目录。可配置 dir 参数更改路径。重启应用后数据自动加载。

Q: 添加大量文档很慢？¶

使用批量添加
减少 chunk_overlap
使用更快的 embedding 模型（如 text-embedding-3-small）
对大文件先进行预处理

Q: 与 LangChain 的区别？¶

Embedchain：极简 API，几行代码完成 RAG，适合快速原型
LangChain：灵活但复杂，适合需要精细控制的场景

参考资源¶

GitHub：https://github.com/mem0ai/embedchain（现为 mem0 的一部分）
文档：https://docs.embedchain.ai/
数据源参考：https://docs.embedchain.ai/data-sources
配置参考：https://docs.embedchain.ai/configuration
示例：https://docs.embedchain.ai/examples