我想弄清楚 Elasticsearch 索引的概念，但不太明白。我想提前提出几点。我了解反向文档索引的工作原理(将术语映射到文档 ID)，我还了解基于 TF-IDF 的文档排名是如何工作的。我不明白的是实际索引的数据结构。在提到 Elasticsearch 文档时，它将索引描述为“具有文档映射的表”。所以，分片来了!!当您查看 Elasticsearch 索引的典型图片时，它表示如下: 图片没有显示的是实际分区是如何发生的，以及这个 [table -> document] 链接是如何拆分到多个分片的。例如，每个分片是否垂直拆分表？这意味着倒排索引表仅包含分片上存在的术语。例如，假设我们有 3 个分片，这意味着第一个将包含 document1，第二个分片仅包含文档 2，第三个分片是文档 3。现在，第一个分片索引是否仅包含 document1 中存在的术语？在这种情况下[蓝色，明亮，蝴蝶，微风，悬挂]。如果是这样，如果有人搜索[忘记]怎么办，elasticsearch怎么“知道”不在分片1中搜索，或者每次都搜索所有分片？当您查看集群图像时:

尚不清楚shard1、shard2、shard3中到底是什么。我们从 Term -> DocumentId -> Document 转到“矩形”分片，但分片究竟包含什么？

如果有人能从上图中解释一下，我将不胜感激。

最佳答案

理论

Elastichsarch 建立在 Lucene 之上。每个分片只是一个 Lucene 索引。 Lucene 索引，如果简化的话，就是倒排索引。每个 Elasticsearch 索引都是一堆分片或 Lucene 索引。当您查询文档时，Elasticsearch 将子查询所有分片，合并结果并将其返回给您。当你索引文档到Elasticsearch时，Elasticsearch会使用公式计算应该写入哪个分片文档

shard = hash(routing) % number_of_primary_shards

默认情况下，Elasticsearch 使用文档 id 作为路由。如果您指定 routing 参数，它将被用来代替 id。您可以在搜索查询和索引、删除或更新文档的请求中使用 routing 参数。默认情况下作为哈希函数使用 MurmurHash3

例子

用 3 个分片创建索引

$ curl -XPUT localhost:9200/so -d '
{ 
    "settings" : { 
        "index" : { 
            "number_of_shards" : 3, 
            "number_of_replicas" : 0 
        } 
    } 
}'

索引文件

$ curl -XPUT localhost:9200/so/question/1 -d '
{ 
    "number" : 47011047, 
    "title" : "need elasticsearch index sharding explanation" 
}'

不带路由的查询

$ curl "localhost:9200/so/question/_search?&pretty"

响应

查看 _shards.total - 这是被查询的分片数量。另请注意，我们找到了文档

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        }
      }
    ]
  }
}

用正确的路由查询

$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"

响应

_shards.total 现在是 1，因为我们指定了路由并且 elasticsearch 知道要向哪个分片请求文档。使用参数 explain=true 我要求 elasticsearch 给我关于查询的额外信息。注意 hits._shard - 它设置为 [so][2]。这意味着我们的文档存储在 so 索引的第二个分片中。

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[so][2]",
        "_node" : "2skA6yiPSVOInMX0ZsD91Q",
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        },
        ...
}

路由不正确的查询

$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"

响应

_shards.total again 1. 但是 Elasticsearch 没有向我们的查询返回任何内容，因为我们指定了错误的路由并且 Elasticsearch 查询没有文档的分片。

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

附加信息

关于Elasticsearch索引分片解释，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47003336/

Elasticsearch索引分片解释

理论

例子