elasticsearch - Elasticsearch荧光笔误报

我在ES 6.1.1中使用了nGram标记生成器，并得到了一些奇怪的亮点:

多个相邻字符ngram高亮不合并为一个

doc9 中

tra错误地突出显示

查询auftrag按预期与文档7和9匹配，但是在doc 9中betrag不正确突出显示。荧光笔存在问题-如果问题出在查询文档8上，则该问题也会被返回。

范例程式码

#!/usr/bin/env bash

# Example based on  
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from 
# https://github.com/elastic/elasticsearch/issues/21000

如果存在则删除索引

curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'

创建新索引

curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
    "settings": {
    "analysis": {
      "analyzer": {
        "trigrams": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "3",
          "max_gram": "3",
          "token_chars": [
            "letter",
            "digit",
            "symbol",
            "punctuation"
          ]
        }
      }
    }
},
    "mappings": {
        "my_type": {
            "properties": {
                "text": {
                    "type":     "text",
                    "analyzer": "trigrams",
                    "term_vector": "with_positions_offsets"
                }
            }
        }
    }
}
'
printf '\n-------------\n'

热门指数

curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1  # Give ES time to index

查询

curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "text": {
                "query": "auftrag",
                "minimum_should_match": "100%"
            }
        }
    },
      "highlight": {
        "fields": {
          "text": {
            "fragment_size": 120,
            "type": "fvh"
          }
        }
      }
}
'

我得到的点击数是(缩写):

"hits" : [
      {
        "_id" : "9",
        "_source" : {
          "text" : "betrag auftragen"
        },
        "highlight" : {
          "text" : [
            "be<em>tra</em>g <em>auf</em><em>tra</em>gen"
          ]
        }
      },
      {
        "_id" : "7",
        "_source" : {
          "text" : "auftragen"
        },
        "highlight" : {
          "text" : [
            "<em>auf</em><em>tra</em>gen"
          ]
        }
      }
    ]

我尝试了各种变通办法，例如使用Unified / FVH荧光笔并设置所有似乎相关但没有运气的选项。任何提示，不胜感激。

最佳答案

这里的问题不是突出显示，而是您使用nGram分析器的方式。

首先，当您以这种方式配置映射时:

"mappings": {
  "my_type": {
    "properties": {
      "text": {
        "type"       : "text",
        "analyzer"   : "trigrams",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}

您在对Elasticsearch说您想将其用于索引文本并提供搜索词。就您而言，这仅意味着:

来自文档9 =“betrag auftragen”的文本被分割为三元组，因此在索引中您具有以下内容:[bet，etr，tra，rag，auf，uft，ftr，tra，rag，age，gen]

您来自文档7 =“auftragen”的文本被拆分为三字母组合，因此在索引中您将具有以下内容:[auf，utf，ftr，tra，rag，age，gen]

您的搜索词=“auftrag”也被拆分为三元组，Elasticsearch则将其视为:[auf，uft，ftr，tra，rag]

最后，

Elasticsearch将搜索中的所有trigram与索引中的trigram相匹配，因此，您分别突出显示了“auf”和“tra”。 'ufa'，'ftr'和'rag'也匹配，但是它们与'auf'和'tra'重叠且未突出显示。

首先，您需要对Elasticsearch说，您不想将搜索字词拆分为g。您需要做的就是将search_analyzer属性添加到映射中:

"mappings": {
  "my_type": {
    "properties": {
      "text": {
        "type"           : "text",
        "analyzer"       : "trigrams",
        "search_analyzer": "standard",
        "term_vector"    : "with_positions_offsets"
      }
    }
  }
}

现在 standard analyzer将搜索词中的单词视为单独的单词，因此在您的情况下，它将只是“auftrag”。

但是，这一单一更改将无济于事。甚至会中断搜索，因为“auftrag”与您索引中的任何三字组都不匹配。

现在，您需要通过增加max_gram来改进nGram标记器:

"tokenizer": {
  "my_ngram_tokenizer": {
    "type": "nGram",
    "min_gram": "3",
    "max_gram": "10",
    "token_chars": [
      "letter",
      "digit",
      "symbol",
      "punctuation"
    ]
  }
}

这样，索引中的文本将分为3克，4克，5克，6克，7克，8克，9克和10克。在这7克中，您会找到“auftrag”(搜索词)。

经过这两项改进后，搜索结果中的突出显示应如下所示:

"betrag <em>auftrag</em>en"

对于文件9和:

"<em>auftrag</em>en"

用于文件7。

这就是ngram和突出显示一起工作的方式。我知道ES documentation is saying:

It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.

这是真的。出于性能原因，您需要尝试此配置，但是希望我向您解释了它是如何工作的。

关于elasticsearch - Elasticsearch荧光笔误报，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49096468/

elasticsearch - Elasticsearch荧光笔误报

上一篇：parsing - 如何使用Powershell将PDF内容解析到数据库

下一篇：powershell - 如何将变量与命令的其余部分组合