我需要在文档中找到短语,并且需要查看标题和内容。标题比内容重要,因此我希望得到以下结果:
似乎是很基本的东西。
所以我创建了这样的索引和数据:
PUT /test_index
PUT /test_index/article/3263
{
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
}
PUT /test_index/article/1005
{
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
}
PUT /test_index/article/677
{
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
}
PUT /test_index/article/666
{
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
}
我运行这样的查询:
GET /test_index/_search
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "Lösungen",
"fields": ["pagetitle^2", "searchable_content"]
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
但是结果却不如我预期。我得到只有标题匹配的文档,然后才得到标题和内容都匹配的文档,如下所示:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "3263",
"_score": 0.5753642,
"_source": {
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "1005",
"_score": 0.36464313,
"_source": {
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
},
"highlight": {
"searchable_content": [
"test! <em>Lösungen</em> test?"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "677",
"_score": 0.36464313,
"_source": {
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test!"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "666",
"_score": 0.2876821,
"_source": {
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc"
]
}
}
]
}
}
我试图做的是通过增加 Realm 来操纵更多。似乎在上述情况下,可以为两个字段设置boost,并使用
most_fields
这样的类型:GET /test_index/_search
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "Lösungen",
"fields": ["pagetitle^3", "searchable_content^2"],
"type": "most_fields"
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
这为这组数据提供了预期的结果。
但是,如果我添加2条额外的记录:
PUT /test_index/article/999
{
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
}
PUT /test_index/article/1006
{
"id": 1006,
"pagetitle": "Lösungen and Lösungen",
"searchable_content": "test sample"
}
它不再起作用了,因为现在的结果是这样的:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 2.2315955,
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "1006",
"_score": 2.2315955,
"_source": {
"id": 1006,
"pagetitle": "Lösungen and Lösungen",
"searchable_content": "test sample"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em> and <em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "666",
"_score": 1.219939,
"_source": {
"id": 666,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "1005",
"_score": 0.86785066,
"_source": {
"id": 1005,
"pagetitle": "Lösungen",
"searchable_content": "test! Lösungen test?"
},
"highlight": {
"searchable_content": [
"test! <em>Lösungen</em> test?"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "677",
"_score": 0.86785066,
"_source": {
"id": 677,
"pagetitle": "Lösungen",
"searchable_content": "test Lösungen test!"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test!"
],
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "3263",
"_score": 0.8630463,
"_source": {
"id": 3263,
"pagetitle": "Lösungen",
"searchable_content": "abc"
},
"highlight": {
"pagetitle": [
"<em>Lösungen</em>"
]
}
},
{
"_index": "test_index",
"_type": "article",
"_id": "999",
"_score": 0.7876096,
"_source": {
"id": 999,
"pagetitle": "abc",
"searchable_content": "test Lösungen test abc double match Lösungen"
},
"highlight": {
"searchable_content": [
"test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
]
}
}
]
}
}
因此,如您所见,仅内容匹配的文本的标题和内容匹配的文本的优先级更高。
您能给我解释一下我在做什么错吗,如何解决?
最佳答案
尝试像这样的恒定分数:
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"constant_score": {
"query": {
"match": {
"pagetitle": {
"query": "Lösungen"
}
}
},
"boost": 2
}
},
{
"constant_score": {
"query": {
"match": {
"searchable_content": "Lösungen"
}
}
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
根据文档显示的恒定分数:“...包装另一个查询,仅返回等于过滤器中每个文档的查询提升的恒定分数。” ref@davide的链接将帮助您理解为什么即使对searchable_content进行匹配也可以使文档得分更高。由于您要忽略字段之间的术语频率和IDF,因此可以在每个字段的匹配项上使用恒定分数。
编辑:
根据原始问题中列出的规则,以上查询可以正常工作。但是,基于OP的评论,我们也需要根据搜索词的出现频率对结果进行排名。因此,显然,术语频率和文档的逆向频率很重要,但是也许我们在这里不太关心字段长度(如果我们只想根据出现次数对结果进行排名)。在这种情况下,我建议您像这样设置索引:
POST test_index_v1
{
"mappings": {
"article": {
"properties": {
"id": {
"type": "long"
},
"pagetitle": {
"type": "string",
"norms": {
"enabled": false
}
},
"searchable_content": {
"type": "string",
"norms": {
"enabled": false
}
}
}
}
}
}
注意:在版本5及更高版本中,type: string
替换为type: text
。@davide提到的link描述了禁用规范的功能。
其次,由于要在少量文档上运行查询,并假设为索引分配了多个分片,因此最好使用
search_type=dfs_query_then_fetch
运行查询,因为每个分片的本地IDF会有很大不同。 (阅读this)第三,在最后一个查询中添加我们想要的只是考虑TF-IDF的权重。最后一个查询是对文档进行完全相同的排名,无论是在同一字段中出现2到3个搜索词。
我们可以添加一个bool-should块,以将其添加到常量得分块的得分中,如下所示:
GET test_index_v1/_search?search_type=dfs_query_then_fetch
{
"query": {
"bool": {
"should": [
{
"constant_score": {
"query": {
"match": {
"pagetitle": {
"query": "Lösungen"
}
}
},
"boost": 2
}
},
{
"constant_score": {
"query": {
"match": {
"searchable_content": "Lösungen"
}
}
}
},
{
"bool": {
"should": [
{
"match": {
"pagetitle": {
"query": "Lösungen",
"boost": 2
}
}
},
{
"match": {
"searchable_content": "Lösungen"
}
}
]
}
}
]
}
},
"highlight": {
"fields": {
"pagetitle": {},
"searchable_content": {}
}
}
}
关于elasticsearch - 使用multi_match的查询未获得预期的顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46213773/