database - 为什么Elasticsearch不返回包含几乎相等词的文档?

标签 database elasticsearch search text full-text-search

我想使用Elasticsearch搜索与用户提供的搜索词相关的文档(文档文本在荷兰语中,还假定用户搜索的单词也在荷兰语中)。

我还使用同义词为拼写不同但在荷兰语中表示同一意思的搜索词返回相同的文档。我将这些同义词存储在elasticsearch的config文件夹中的aliased.txt文件中。

为了测试搜索是否正常工作,我使用loopbaan这个单词作为用户可能搜索内容的示例。此外,在onymous.txt文件中,我将此词与其同义词carriere链接在一起。这是通过以下格式完成的:

...
loopbaan, carriere
...

现在,当我使用分析器分析loopbaan时,如下所示:
GET /documents/_analyze
{
    "analyzer": "test_analyzer",
    "text": "loopbaan"
}

我得到以下结果:
{
    "tokens": [
        {
            "token": "loopban",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "carrier",
            "start_offset": 0,
            "end_offset": 8,
            "type": "SYNONYM",
            "position": 0
        }
    ]
}

我知道loopbaan会转换为loopban,因为我使用的是荷兰词干,但是 loopban的确与荷兰语中的loopbaan含义相同,并且在我在文档中已建立索引的任何文本中都存在 NOT

因此,当我使用以下查询搜索loopbaan时:
{
    "query": {
        "simple_query_string": {
            "query": "loopbaan",
            "fields": [
                "content^1.0"
            ],
            "analyzer": "test_analyzer",
            "flags": -1,
            "default_operator": "or",
            "analyze_wildcard": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "fuzzy_transpositions": true,
            "boost": 1
        }
    }
}

我没有结果:
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

问题:
如何通过搜索“loopbaan”一词来获得预期的结果(我知道至少有5个文档包含“loopbaan”一词)?

注意:我知道elasticsearch中存在stemmer-override,但是我希望搜索尽可能通用,并且每次荷兰词干搜索器工作不佳时,都不要在词干替换中添加单词。我也想从loopbaan的复数形式(即loopbanen)返回与搜索loopbaan时完全相同的结果。这就是为什么我使用词干分析器。

,这就是我创建文档索引的方式:
PUT /documents
{
    "aliases": {},
    "mappings": {
        "properties": {
            "content": {
                "type": "text"
            },
            "title": {
                "type": "text"
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "test_synonyms": {
                    "type": "synonym",
                    "synonyms_path": "synonyms.txt",
                    "lenient": "true"
                },
                "dutch_stemmer": {
                    "type": "stemmer",
                    "language": "dutch"
                },
                "dutch_stopwords": {
                    "type": "stop",
                    "stopwords": "_dutch_"
                },
                "test_ascii_folding": {
                    "type": "asciifolding"
                }
            },
            "analyzer": {
                "test_analyzer": {
                    "filter": [
                        "lowercase",
                        "test_ascii_folding",
                        "dutch_stopwords",
                        "dutch_stemmer",
                        "test_synonyms"
                    ],
                    "tokenizer": "standard"
                }
            }
        }
    }
}

UPDATE :

2个可复制的同义词:
loopbaan, carriere => loopbaan, carriere
schakelen, koppelen, toggelen => schakelen, koppelen, toggelen

3个要复制的文档(第一个和第三个示例应与loopbanenloopbaan匹配,因为它们包含carriere ):
{
   "title": "Hoezo is dit goed gedaan in het onderwijs?"
   "content": "Werken is goed voor de mensen die in Nederlands wonen. Het verbetert de economie en de welzijn van de mensen. Carrière opbouwen is ook zeer belangrijk voor de specialisatie van de nederlandse mensen in onze samenleving."
}, 
{
   "title": "Dit slaat toch nergens op dat mensen dit kunnen doen."
   "content": "Mensen moeten koppelen. Wat nou "dit" is in deze context weet ik ook niet maar ja zo kan je zien dat elke woord zomaar iets kan betekenen toch? Zou zeggen van wel maar dit heeft niks te maken met iets dus de mazzel."
},
{
   "title": "Werken moet door iedereen gedaan worden en niet alleen door paar mensen in nederland"
   "content": "Werken moet door iedereen gedaan worden en niet alleen door paar mensen in nederland. Het moet echt zo zijn dat mensen carrieres opbouwen en niet alleen thuis zitten, want dat is slecht voor gezondheid van de mensen en de economie over het algemeen."
}

最佳答案

您正在使用一个分析器建立索引,使用另一个分析器进行搜索。推荐的做事方法可以在here中找到

有两种方法可以满足您的需求。

  • 您可以使用multi-fields。您将自定义分析器用于一个字段,并在查询时使用与您使用的完全相同的分析器(在这种情况下,标准分析器是文本的默认设置,可以省略)。
  • {
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "fields": {
              "stemmed": {
                "type": "text",
                "analyzer": "test_analyzer"
              }
            }
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
    
    {
      "query": {
        "simple_query_string": {
          "query": "loopbaan",
          "fields": [
            "content^1.0",
            "context.stemmed^1.0"
          ],
          "analyzer": "test_analyzer",
          "flags": -1,
          "default_operator": "or",
          "analyze_wildcard": false,
          "auto_generate_synonyms_phrase_query": true,
          "fuzzy_prefix_length": 0,
          "fuzzy_max_expansions": 50,
          "fuzzy_transpositions": true,
          "boost": 1
        }
      }
    }
    

    该解决方案在您的群集上非常繁重,因为它将使您的索引更大
  • 您可以修改查询并两次分析您的查询,然后将其包装在应该子句中的 bool(boolean) 查询中。基本上你就是这个
  • Match MY_QUERY(analyzed with my custom analyzer) 
    OR 
    Match MY_QUERY(by using the same analyzer my field used when it was saved)
    
    {
      "query": {
        "bool": {
          "minimum_should_match": 1, 
          "should": [
            {
              "simple_query_string": {
                "query": "loopbaan",
                "fields": [
                  "content^1.0"
                ],
                "analyzer": "test_analyzer",
                "flags": -1,
                "default_operator": "or",
                "analyze_wildcard": false,
                "auto_generate_synonyms_phrase_query": true,
                "fuzzy_prefix_length": 0,
                "fuzzy_max_expansions": 50,
                "fuzzy_transpositions": true,
                "boost": 1
              }
            },
            {
              "simple_query_string": {
                "query": "loopbaan",
                "fields": [
                  "content^1.0"
                ],
                "flags": -1,
                "default_operator": "or",
                "analyze_wildcard": false,
                "auto_generate_synonyms_phrase_query": true,
                "fuzzy_prefix_length": 0,
                "fuzzy_max_expansions": 50,
                "fuzzy_transpositions": true,
                "boost": 1
              }
            }
          ]
        }
      }
    }
    

    我会用第二个选项

    总之,您可以选择两次分析文档还是两次分析查询。由你决定。

    更新资料
    PUT documents
    {
      "aliases": {},
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "test_analyzer_without_stemmer"
          },
          "title": {
            "type": "text"
          }
        }
      },
      "settings": {
        "analysis": {
          "filter": {
            "test_synonyms": {
              "type": "synonym",
              "synonyms": [
                "loopbaan,carriere,carrieres",
                "schakelen,koppelen,toggelen"
              ],
              "lenient": "true"
            },
            "dutch_stemmer": {
              "type": "stemmer",
              "language": "dutch"
            },
            "dutch_stopwords": {
              "type": "stop",
              "stopwords": "_dutch_"
            },
            "test_ascii_folding": {
              "type": "asciifolding"
            }
          },
          "analyzer": {
            "test_analyzer": {
              "filter": [
                "lowercase",
                "test_ascii_folding",
                "dutch_stopwords",
                "dutch_stemmer",
                "test_synonyms"
              ],
              "tokenizer": "standard"
            },
            "test_analyzer_without_stemmer": {
              "filter": [
                "lowercase",
                "test_ascii_folding",
                "dutch_stopwords",
                "test_synonyms"
              ],
              "tokenizer": "standard"
            }
          }
        }
      }
    }
    
    PUT documents/_doc/1
    {
       "title": "Hoezo is dit goed gedaan in het onderwijs?",
       "content": "Werken is goed voor de mensen die in Nederlands wonen. Het verbetert de economie en de welzijn van de mensen. Carrière opbouwen is ook zeer belangrijk voor de specialisatie van de nederlandse mensen in onze samenleving."
    }
    
    PUT documents/_doc/2
    {
       "title": "Dit slaat toch nergens op dat mensen dit kunnen doen.",
       "content": "Mensen moeten koppelen. Wat nou \"dit\" is in deze context weet ik ook niet maar ja zo kan je zien dat elke woord zomaar iets kan betekenen toch? Zou zeggen van wel maar dit heeft niks te maken met iets dus de mazzel."
    }
    
    PUT documents/_doc/3
    {
       "title": "Werken moet door iedereen gedaan worden en niet alleen door paar mensen in nederland",
       "content": "Werken moet door iedereen gedaan worden en niet alleen door paar mensen in nederland. Het moet echt zo zijn dat mensen carrieres opbouwen en niet alleen thuis zitten, want dat is slecht voor gezondheid van de mensen en de economie over het algemeen."
    }
    
    GET documents/_search
    {
      "query": {
        "bool": {
          "minimum_should_match": 1, 
          "should": [
            {
              "simple_query_string": {
                "query": "loopbaan",
                "fields": [
                  "content"
                ],
                "analyzer": "test_analyzer",
                "flags": -1,
                "default_operator": "or",
                "analyze_wildcard": false,
                "auto_generate_synonyms_phrase_query": true,
                "fuzzy_prefix_length": 0,
                "fuzzy_max_expansions": 50,
                "fuzzy_transpositions": true,
                "boost": 1
              }
            },
            {
              "simple_query_string": {
                "query": "loopbaan",
                "fields": [
                  "content^1.0"
                ],
                "default_operator": "or",
                "flags": -1,
                "analyze_wildcard": false,
                "auto_generate_synonyms_phrase_query": true,
                "fuzzy_prefix_length": 0,
                "fuzzy_max_expansions": 50,
                "fuzzy_transpositions": true,
                "boost": 1
              }
            }
          ]
        }
      }
    }
    

    关于database - 为什么Elasticsearch不返回包含几乎相等词的文档?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61736602/

    相关文章:

    mongodb - 客户端之间的数据同步

    elasticsearch - 计算 Elasticsearch 中的子页面数量

    python - 使用 Django Haystack 添加 ElasticSearch 同义词

    java - 有效地查找对象数组中的元素

    java.sql.Date错误如何避免?将日期插入我的数据库

    sql - 如何修改此查询以仅获取字段值不为 0 的最后一条记录?

    android - 处理搜索的最佳实践

    java - 用于 pmi 的 web 搜索 api

    mysql - SQL:通过外键指定唯一约束并高效查询

    java - 使用Java,如何在Elasticsearch中将匹配查询的默认运算符更改为AND?