elasticsearch - 使用multi_match的查询未获得预期的顺序

标签 elasticsearch

我需要在文档中找到短语,并且需要查看标题和内容。标题比内容重要,因此我希望得到以下结果:

  • 获取标题和内容都匹配的第一个文档
  • 然后仅在标题
  • 中具有匹配项的文档
  • 然后获取仅在内容
  • 中具有匹配项的文档

    似乎是很基本的东西。

    所以我创建了这样的索引和数据:
    PUT /test_index
    
    PUT /test_index/article/3263
    {
      "id": 3263,
      "pagetitle": "Lösungen",
      "searchable_content": "abc"
    }
    
    
    PUT /test_index/article/1005
    {
      "id": 1005,
      "pagetitle": "Lösungen",
      "searchable_content": "test! Lösungen test?"
    }
    
    PUT /test_index/article/677
    {
      "id": 677,
      "pagetitle": "Lösungen",
      "searchable_content": "test Lösungen test!"
    }
    
    PUT /test_index/article/666
    {
      "id": 666,
      "pagetitle": "abc",
      "searchable_content": "test Lösungen test abc"
    }
    

    我运行这样的查询:
    GET /test_index/_search
    {
        "query": {
            "bool": {
                "must": [{
                        "multi_match": {
                            "query": "Lösungen",
                            "fields": ["pagetitle^2", "searchable_content"]
                        }
                    }
                ]
            }
        },
        "highlight": {
            "fields": {
                "pagetitle": {},
                "searchable_content": {}
            }
        }
    }
    

    但是结果却不如我预期。我得到只有标题匹配的文档,然后才得到标题和内容都匹配的文档,如下所示:
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 4,
        "max_score": 0.5753642,
        "hits": [
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "3263",
            "_score": 0.5753642,
            "_source": {
              "id": 3263,
              "pagetitle": "Lösungen",
              "searchable_content": "abc"
            },
            "highlight": {
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "1005",
            "_score": 0.36464313,
            "_source": {
              "id": 1005,
              "pagetitle": "Lösungen",
              "searchable_content": "test! Lösungen test?"
            },
            "highlight": {
              "searchable_content": [
                "test! <em>Lösungen</em> test?"
              ],
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "677",
            "_score": 0.36464313,
            "_source": {
              "id": 677,
              "pagetitle": "Lösungen",
              "searchable_content": "test Lösungen test!"
            },
            "highlight": {
              "searchable_content": [
                "test <em>Lösungen</em> test!"
              ],
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "666",
            "_score": 0.2876821,
            "_source": {
              "id": 666,
              "pagetitle": "abc",
              "searchable_content": "test Lösungen test abc"
            },
            "highlight": {
              "searchable_content": [
                "test <em>Lösungen</em> test abc"
              ]
            }
          }
        ]
      }
    }
    

    我试图做的是通过增加 Realm 来操纵更多。似乎在上述情况下,可以为两个字段设置boost,并使用most_fields这样的类型:
    GET /test_index/_search
    {
        "query": {
            "bool": {
                "must": [{
                        "multi_match": {
                            "query": "Lösungen",
                            "fields": ["pagetitle^3", "searchable_content^2"],
                            "type": "most_fields"                       
                        }
                    }
                ]
            }
        },
        "highlight": {
            "fields": {
                "pagetitle": {},
                "searchable_content": {}
            }
        }
    }
    

    这为这组数据提供了预期的结果。

    但是,如果我添加2条额外的记录:
    PUT /test_index/article/999
    {
      "id": 999,
      "pagetitle": "abc",
      "searchable_content": "test Lösungen test abc double match Lösungen"
    }
    
    
    PUT /test_index/article/1006
    {
      "id": 1006,
      "pagetitle": "Lösungen and Lösungen",
      "searchable_content": "test sample"
    }
    

    它不再起作用了,因为现在的结果是这样的:
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 6,
        "max_score": 2.2315955,
        "hits": [
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "1006",
            "_score": 2.2315955,
            "_source": {
              "id": 1006,
              "pagetitle": "Lösungen and Lösungen",
              "searchable_content": "test sample"
            },
            "highlight": {
              "pagetitle": [
                "<em>Lösungen</em> and <em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "666",
            "_score": 1.219939,
            "_source": {
              "id": 666,
              "pagetitle": "abc",
              "searchable_content": "test Lösungen test abc"
            },
            "highlight": {
              "searchable_content": [
                "test <em>Lösungen</em> test abc"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "1005",
            "_score": 0.86785066,
            "_source": {
              "id": 1005,
              "pagetitle": "Lösungen",
              "searchable_content": "test! Lösungen test?"
            },
            "highlight": {
              "searchable_content": [
                "test! <em>Lösungen</em> test?"
              ],
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "677",
            "_score": 0.86785066,
            "_source": {
              "id": 677,
              "pagetitle": "Lösungen",
              "searchable_content": "test Lösungen test!"
            },
            "highlight": {
              "searchable_content": [
                "test <em>Lösungen</em> test!"
              ],
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "3263",
            "_score": 0.8630463,
            "_source": {
              "id": 3263,
              "pagetitle": "Lösungen",
              "searchable_content": "abc"
            },
            "highlight": {
              "pagetitle": [
                "<em>Lösungen</em>"
              ]
            }
          },
          {
            "_index": "test_index",
            "_type": "article",
            "_id": "999",
            "_score": 0.7876096,
            "_source": {
              "id": 999,
              "pagetitle": "abc",
              "searchable_content": "test Lösungen test abc double match Lösungen"
            },
            "highlight": {
              "searchable_content": [
                "test <em>Lösungen</em> test abc double match <em>Lösungen</em>"
              ]
            }
          }
        ]
      }
    }
    

    因此,如您所见,仅内容匹配的文本的标题和内容匹配的文本的优先级更高。

    您能给我解释一下我在做什么错吗,如何解决?

    最佳答案

    尝试像这样的恒定分数:

    GET test_index/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "constant_score": {
                "query": {
                  "match": {
                    "pagetitle": {
                      "query": "Lösungen"
                    }
                  }
                },
                "boost": 2
              }
            },
            {
              "constant_score": {
                "query": {
                  "match": {
                    "searchable_content": "Lösungen"
                  }
                }
              }
            }
          ]
        }
      },
      "highlight": {
        "fields": {
          "pagetitle": {},
          "searchable_content": {}
        }
      }
    }
    
    根据文档显示的恒定分数:“...包装另一个查询,仅返回等于过滤器中每个文档的查询提升的恒定分数。” ref
    @davide的链接将帮助您理解为什么即使对searchable_content进行匹配也可以使文档得分更高。由于您要忽略字段之间的术语频率和IDF,因此可以在每个字段的匹配项上使用恒定分数。
    编辑:
    根据原始问题中列出的规则,以上查询可以正常工作。但是,基于OP的评论,我们也需要根据搜索词的出现频率对结果进行排名。因此,显然,术语频率和文档的逆向频率很重要,但是也许我们在这里不太关心字段长度(如果我们只想根据出现次数对结果进行排名)。在这种情况下,我建议您像这样设置索引:
    POST test_index_v1
    {
      "mappings": {
          "article": {
            "properties": {
              "id": {
                "type": "long"
              },
              "pagetitle": {
                "type": "string",
                "norms": {
                  "enabled": false
                }
              },
              "searchable_content": {
                "type": "string",
                "norms": {
                  "enabled": false
                }
              }
            }
          }
       }
    }
    
    注意:在版本5及更高版本中,type: string替换为type: text
    @davide提到的link描述了禁用规范的功能。
    其次,由于要在少量文档上运行查询,并假设为索引分配了多个分片,因此最好使用search_type=dfs_query_then_fetch运行查询,因为每个分片的本地IDF会有很大不同。 (阅读this)
    第三,在最后一个查询中添加我们想要的只是考虑TF-IDF的权重。最后一个查询是对文档进行完全相同的排名,无论是在同一字段中出现2到3个搜索词。
    我们可以添加一个bool-should块,以将其添加到常量得分块的得分中,如下所示:
    GET test_index_v1/_search?search_type=dfs_query_then_fetch
    {
      "query": {
        "bool": {
          "should": [
            {
              "constant_score": {
                "query": {
                  "match": {
                    "pagetitle": {
                      "query": "Lösungen"
                    }
                  }
                },
                "boost": 2
              }
            },
            {
              "constant_score": {
                "query": {
                  "match": {
                    "searchable_content": "Lösungen"
                  }
                }
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "match": {
                      "pagetitle": {
                        "query": "Lösungen",
                        "boost": 2
                      }
                    }
                  },
                  {
                    "match": {
                      "searchable_content": "Lösungen"
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "highlight": {
        "fields": {
          "pagetitle": {},
          "searchable_content": {}
        }
      }
    }
    

    关于elasticsearch - 使用multi_match的查询未获得预期的顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46213773/

    相关文章:

    java - 在 Elasticsearch 中搜索列数据的最后四位

    elasticsearch - elasticsearch:索引和搜索阿拉伯文本

    JSON 格式的 Elasticsearch cat 索引

    ruby-on-rails - 如何设置 elasticsearch-rails 非规范化映射

    java - Elasticsearch:按字母顺序排序,忽略数字和特殊字符

    python - 使用Python将数据帧索引到Elasticsearch中

    elasticsearch - 将坐标存储为geo_point的问题

    elasticsearch - 在elasticSearch中使用RestClient检索特定字段

    php - 在Elasticsearch中将没有空格的单词与带有空格的文本进行匹配

    Elasticsearch - 没有可用的节点