elasticsearch - 在elasticsearch的较早位置为包含搜索查询的匹配项分配较高的分数

标签 elasticsearch n-gram relevance booleanquery

这个问题类似于Val回答的其他问题enter link description here

我有一个包含3个文档的索引。

    {
            "firstname": "Anne",
            "lastname": "Borg",
        }

    {
            "firstname": "Leanne",
            "lastname": "Ray"
        },

    {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }

当我搜索“Ann”时,我希望Elastic返回所有这3个文档(因为它们在一定程度上都与术语“Ann”匹配)。但是,我希望Leanne Ray的得分(相关性排名)较低,因为搜索词“Ann”在该文档中的出现位置要比其他两个文档中的出现位置晚。

这是我的索引设置...
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "custom"
                    ],
                    "custom_token_chars": "'-",
                    "min_gram": "1",
                    "type": "ngram",
                    "max_gram": "2"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

以下查询带回了预期的文档,但对Leanne Ray的评分高于对Anne Borg的评分。
{
    "query": {
        "bool": {
            "must": {
                "query_string": {
                    "query": "Ann",
                    "fields": ["full_name"]
                }
            },
            "should": {
                "match": {
                    "full_name": "Ann"}
            }
        }
    }
}

结果如下...
"hits": [
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "2",
            "_score": 6.6333585,
            "_source": {
                "firstname": "Anne",
                "middlename": "M",
                "lastname": "Stone"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "1",
            "_score": 6.142234,
            "_source": {
                "firstname": "Leanne",
                "lastname": "Ray"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "3",
            "_score": 6.079495,
            "_source": {
                "firstname": "Anne",
                "lastname": "Borg"
            }
        }

一起使用ngram token 过滤器和ngram token 生成器似乎可以解决此问题...
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "ngram"
                    ],
                    "tokenizer": "ngram"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}

相同的查询会以期望的相对得分带回预期的结果。 为什么起作用? 请注意,上面我使用的是带有小写过滤器的ngram标记器,这里唯一的区别是我使用的是ngram过滤器而不是小写过滤器。

这是结果。请注意,Leanne Ray的得分要低于Anne Borg和Anne M Stone。
"hits": [
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.953257,
        "_source": {
            "firstname": "Anne",
            "lastname": "Borg"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.87168,
        "_source": {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0364896,
        "_source": {
            "firstname": "Leanne",
            "lastname": "Ray"
        }
    }

顺便说一句,当索引还包含其他文档时,该查询还会带回大量误报结果。并不是这样的问题,因为相对于理想命中的得分,误报得分很低。但是仍然不理想。例如,如果我在文档中添加{firstname:Gideon,lastname:Grossma},则上面的查询也将带回结果集中的该文档-尽管得分比包含字符串“Ann”的文档低得多

最佳答案

答案与链接线程中的答案相同。由于您正在对所有索引数据进行ngram处理,因此AnnAnne的工作方式相同,不过您会获得完全相同的响应(请参见下文),但得分不同:

"hits" : [
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Jr-DHIBhYuDqANwSeiw",
    "_score" : 4.8442974,
    "_source" : {
      "firstname" : "Anne",
      "lastname" : "Borg"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5pr-DHIBhYuDqANwSeiw",
    "_score" : 4.828779,
    "_source" : {
      "firstname" : "Anne",
      "middlename" : "M",
      "lastname" : "Stone"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Zr-DHIBhYuDqANwSeiw",
    "_score" : 0.12874341,
    "_source" : {
      "firstname" : "Leanne",
      "lastname" : "Ray"
    }
  }
]

更新

这是一个修改后的查询,可用于检查零件(即annanne)。同样,套管在这里没有任何区别,因为分析仪在分度之前会小写所有内容。
{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "ann",
          "fields": [
            "full_name"
          ]
        }
      },
      "should": [
        {
          "match_phrase_prefix": {
            "firstname": {
              "query": "ann",
              "boost": "10"
            }
          }
        },
        {
          "match_phrase_prefix": {
            "lastname": {
              "query": "ann",
              "boost": "10"
            }
          }
        }
      ]
    }
  }
}

关于elasticsearch - 在elasticsearch的较早位置为包含搜索查询的匹配项分配较高的分数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61768534/

相关文章:

spring - java.lang.AbstractMethodError 与 Spring-data-elasticsearch

solr - 在 solr 中哪种搜索更好(性能方面)?使用 EdgeNGram 或通配符搜索自动完成?

Solr - 在学期开始时以完全匹配的方式提升结果

django - 为每个用户检索最相关输出的最佳解决方案(在 Django 或任何后端中)?

elasticsearch - Elasticsearch如何查询多重匹配或函数

java - spring-data-elasticsearch 在多个索引中搜索特定字段

python - 为什么根据查询方式的不同,elasticsearch 报告的命中数会不同?

python - 在 Pandas 数据框中形成单词的二元组

c# - 给定每个项目的概率,从列表中选择随机项目

database - 是否有任何免费数据库存储关键字和其他相关关键字,供应用程序确定语义相关性?