elasticsearch - 面向单词的完成建议器 (ElasticSearch 5.x)

标签 elasticsearch autocomplete duplicates elasticsearch-5

ElasticSearch 5.x 对 Suggester API ( Documentation ) 引入了一些(重大)更改。最显着的变化如下:

Completion suggester is document-oriented

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

简而言之,所有完成查询都会返回所有匹配的文档,而不仅仅是匹配的。这就是问题所在 - 如果自动完成的单词出现在多个文档中,则会出现重复。

假设我们有这个简单的映射:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

加上一些测试文档:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

以及按书查询:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

产生以下结果:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

简而言之,对于文本“joh”的完成建议,返回了两 (2) 个文档 - John 和两者都具有相同的 text 属性值.

但是,我想收到一 (1) 个。像这样简单的东西:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

问题:如何实现基于单词的补全建议器。无需返回任何文档相关数据,因为此时我不需要它。

“Completion Suggester”是否适合我的场景?还是我应该使用完全不同的方法?


编辑: 正如你们中的许多人指出的那样,一个额外的仅完成索引将是一个可行的解决方案。但是,我可以看到这种方法存在多个问题:

  1. 保持新索引同步。
  2. 自动完成后续单词可能是全局的,而不是缩小范围。例如,假设附加索引中有以下单词:"John"、"Doe"、"David"、"Smith"。查询 "John D" 时,不完整单词的结果应该是 "Doe" 而不是 "Doe", "David"

要克服第二点,仅索引单个单词是不够的,因为您还需要将所有单词映射到文档,以便正确缩小自动完成后续单词的范围。这样,您实际上遇到了与查询原始索引相同的问题。因此,附加索引不再有意义。

最佳答案

正如评论中所暗示的,另一种在不获取重复文档的情况下实现此目的的方法是为包含该字段的 ngram 的 firstname 字段创建一个子字段。首先你定义你的映射是这样的:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

然后你索引了一些文件:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

然后您可以查询 joh 并为 John 获取一个结果,为 Johnny 获取另一个结果

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

结果:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}

更新(2019 年 6 月 25 日):

ES 7.2 引入了一种名为 search_as_you_type 的新数据类型,它本身就允许这种行为。阅读更多信息:https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

关于elasticsearch - 面向单词的完成建议器 (ElasticSearch 5.x),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41744712/

相关文章:

elasticsearch - 如何将Elastic search与cassandra集成?

python - MySQL存储和搜索文本

java - 联系人姓名的自动填充 TextView 可获取其号码?

php - 为数组设置多个键和值

google-maps-api-3 - 向 Google 自动完成添加自定义(获取位置)选项

r - 如何对数据框进行子集化以仅保留第一个重复项?

url - Logstash:为文档创建 url 友好的 _id

elasticsearch - 按日期和类别 Elasticsearch 查询分组

mysql - 在 MySQL 中查找重复记录

mysql - 将重复的匹配对分组到一个组中 - SQL