610 Elasticsearch 全文搜索¶

一句话概述：Elasticsearch（简称 ES）是基于 Lucene 的分布式搜索引擎，核心能力是全文搜索和实时数据分析，电商搜索、日志分析、应用监控都在用它。

核心知识点速查表¶

知识点	说明
最新版本	8.19（2025年）
核心原理	倒排索引（Inverted Index）
评分算法	BM25（基于TF-IDF）
API 协议	RESTful JSON
生态工具	Kibana（可视化）、Logstash（采集）、Beats（轻量采集）
适用场景	全文搜索、日志分析、APM监控、安全分析

一、安装配置¶

1.1 Docker 安装（推荐）¶

# 单节点 Docker 部署
docker run -d \
  --name elasticsearch \               # 容器名
  -p 9200:9200 \                       # REST API 端口
  -p 9300:9300 \                       # 节点通信端口
  -e "discovery.type=single-node" \    # 单节点模式（开发用）
  -e "xpack.security.enabled=false" \  # 关闭安全认证（开发用）
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \  # JVM 内存设置
  -v es_data:/usr/share/elasticsearch/data \  # 数据持久化
  docker.elastic.co/elasticsearch/elasticsearch:8.19.0

# 验证
curl http://localhost:9200              # 应返回集群信息 JSON

# 同时启动 Kibana（可视化界面）
docker run -d \
  --name kibana \
  -p 5601:5601 \                       # Kibana 端口
  -e "ELASTICSEARCH_HOSTS=http://elasticsearch:9200" \
  --link elasticsearch \               # 连接 ES 容器
  docker.elastic.co/kibana/kibana:8.19.0

二、基本使用¶

2.1 索引操作¶

# 创建索引（白话：索引 ≈ 数据库的"表"）
curl -X PUT "localhost:9200/articles" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,             # 分片数（数据分几块存储）
    "number_of_replicas": 0            # 副本数（开发环境设0）
  },
  "mappings": {                         # 字段映射（类似建表的字段定义）
    "properties": {
      "title": {
        "type": "text",                # text 类型会分词，用于全文搜索
        "analyzer": "standard"         # 标准分析器
      },
      "content": { "type": "text" },
      "author": {
        "type": "keyword"             # keyword 类型不分词，用于精确匹配
      },
      "publish_date": { "type": "date" },
      "views": { "type": "integer" },
      "tags": { "type": "keyword" }    # 数组字段也用 keyword
    }
  }
}'

# 查看索引
curl "localhost:9200/_cat/indices?v"    # 列出所有索引

# 删除索引
curl -X DELETE "localhost:9200/articles"

2.2 文档操作（CRUD）¶

# 插入文档（指定ID）
curl -X PUT "localhost:9200/articles/_doc/1" -H 'Content-Type: application/json' -d'
{
  "title": "Elasticsearch入门教程",
  "content": "Elasticsearch是一个分布式搜索引擎，支持全文搜索和实时分析",
  "author": "张三",
  "publish_date": "2024-01-15",
  "views": 1500,
  "tags": ["搜索", "数据库", "教程"]
}'

# 批量插入（_bulk API，性能更好）
curl -X POST "localhost:9200/articles/_bulk" -H 'Content-Type: application/json' -d'
{"index": {"_id": "2"}}
{"title": "Redis缓存实战", "content": "Redis是内存数据库", "author": "李四", "views": 800, "tags": ["缓存"]}
{"index": {"_id": "3"}}
{"title": "Python数据分析", "content": "使用Pandas进行数据处理", "author": "王五", "views": 2000, "tags": ["Python"]}
'

# 获取文档
curl "localhost:9200/articles/_doc/1"

# 更新文档
curl -X POST "localhost:9200/articles/_update/1" -H 'Content-Type: application/json' -d'
{"doc": {"views": 1600}}'

# 删除文档
curl -X DELETE "localhost:9200/articles/_doc/1"

2.3 搜索¶

# 全文搜索（白话：像搜索引擎一样搜索）
curl "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {                         # match 查询：分词后匹配
      "content": "数据库 搜索"          # 搜索包含"数据库"或"搜索"的文档
    }
  }
}'

# 精确匹配（keyword 字段）
curl "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": { "author": "张三" }       # term 查询：精确匹配，不分词
  }
}'

# 布尔组合查询（白话：多个条件组合，类似 SQL 的 AND/OR/NOT）
curl "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [                        # 必须满足（AND）
        { "match": { "content": "数据库" } }
      ],
      "filter": [                      # 过滤（不影响评分，更快）
        { "range": { "views": { "gte": 1000 } } }  # views >= 1000
      ],
      "should": [                      # 最好满足（加分项）
        { "match": { "tags": "教程" } }
      ],
      "must_not": [                    # 必须不满足（NOT）
        { "term": { "author": "测试用户" } }
      ]
    }
  },
  "sort": [{ "views": "desc" }],       # 按浏览量降序
  "from": 0, "size": 10,               # 分页：从第0条开始取10条
  "_source": ["title", "author", "views"],  # 只返回这些字段
  "highlight": {                        # 高亮匹配的关键词
    "fields": { "content": {} }
  }
}'

三、聚合分析¶

# 聚合（白话：类似 SQL 的 GROUP BY + 统计函数）
curl "localhost:9200/articles/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,                           # 不要搜索结果，只要聚合结果
  "aggs": {
    "authors_count": {                 # 聚合名称（自定义）
      "terms": {                       # 分组聚合
        "field": "author",             # 按作者分组
        "size": 10                     # 返回前10个
      },
      "aggs": {                        # 嵌套聚合
        "avg_views": {                 # 每个作者的平均浏览量
          "avg": { "field": "views" }
        },
        "max_views": {                 # 每个作者的最高浏览量
          "max": { "field": "views" }
        }
      }
    },
    "views_histogram": {               # 浏览量分布直方图
      "histogram": {
        "field": "views",
        "interval": 500                # 每500一个区间
      }
    },
    "monthly_articles": {              # 按月统计文章数
      "date_histogram": {
        "field": "publish_date",
        "calendar_interval": "month"
      }
    }
  }
}'

四、Python 客户端¶

# pip install elasticsearch
from elasticsearch import Elasticsearch  # 导入客户端

# 连接
es = Elasticsearch("http://localhost:9200")  # 连接ES

# 检查连接
print(es.info())                       # 打印集群信息

# 索引文档
es.index(index="articles", id=1, document={
    "title": "Python入门",
    "content": "Python是一种流行的编程语言",
    "author": "张三",
    "views": 5000
})

# 搜索
result = es.search(index="articles", query={
    "match": {"content": "Python"}
})
for hit in result["hits"]["hits"]:     # 遍历搜索结果
    print(f"标题: {hit['_source']['title']}, 评分: {hit['_score']}")

# 聚合
result = es.search(index="articles", size=0, aggs={
    "top_authors": {
        "terms": {"field": "author", "size": 5}
    }
})
for bucket in result["aggregations"]["top_authors"]["buckets"]:
    print(f"{bucket['key']}: {bucket['doc_count']}篇")

es.close()

五、常见报错与解决¶

5.1 分片未分配¶

status: "yellow" (unassigned shards)

原因：副本分片无法分配（单节点时无法放副本）解决：PUT /index/_settings {"number_of_replicas": 0}

5.2 磁盘空间不足¶

flood stage disk watermark exceeded

解决：清理旧索引 DELETE /old-index-* 或增加磁盘空间。

5.3 中文搜索不准确¶

原因：默认分析器不支持中文解决：安装 IK 分词器插件 elasticsearch-plugin install analysis-ik。

六、速查表¶

操作	API
集群健康	`GET /_cluster/health`
列出索引	`GET /_cat/indices?v`
创建索引	`PUT /index`
删除索引	`DELETE /index`
插入文档	`PUT /index/_doc/id`
搜索	`GET /index/_search`
聚合	`GET /index/_search` + `aggs`
批量操作	`POST /_bulk`
查看映射	`GET /index/_mapping`

七、同类工具对比¶

特性	Elasticsearch	Meilisearch	Typesense	PostgreSQL FTS
全文搜索	极强	强	强	基础
实时分析	强	弱	弱	中
部署复杂度	高	低	低	无（内置）
资源消耗	高(JVM)	低	低	低
中文支持	插件	内置	内置	需配置
许可证	SSPL	MIT	GPL3	PostgreSQL

选型建议：企业级搜索+分析选 ES；简单搜索选 Meilisearch（更轻量）；已用 PostgreSQL 的项目先试内置 FTS。

参考资料：Elasticsearch 官方文档 | Elastic 官网 | IBM ES指南