node.js - match_phrase 不适用于 Elasticsearch 中的同义词标记过滤器(类型扩展)

更新 :

在阅读了 Richa 的解释和推荐的 blog post 后，该问题似乎已得到解决。，但我需要更多测试才能确认。

首先，应按照 Richa 的建议更改同义词格式:

[“绿色 => 卡其色，绿色”，“宠物 => 猫，宠物”]

然后，我必须指定 search_analyzer和 index_analyzer在索引映射中:

  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "index_analyzer": "standard",
        "search_analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }

在映射中添加这两个属性后，我不需要使用 analyzer在查询中。

这些变化似乎使流派扩展在“term”和“match_phrase”查询中都能按预期工作。

Elasticsearch 7.2

同义词数据:
[ "khaki => khaki,green", "cat => cat,pet"]
索引映射:

{
    settings: {
        "analysis": {
            "char_filter": {
                "same_word": {
                    "type": "mapping",
                    "mappings": ["-=>", "&=>and"]
                },
            },
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": STOPWORD_FILE
                },
                "my_synonym": {
                    "type": "synonym",
                    "synonyms": [ "khaki => khaki,green", "cat => cat,pet"],
                    "tokenizer": "whitespace"
                },
            },
            "analyzer": {
                "lowercaseWhiteSpaceAnalyzer": {
                    "type": "custom",
                    "char_filter": ["html_strip", "same_word"],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_stopwords",
                        "my_synonym",
                    ]
                },
            }
        }
    }
}

Material 领域:

"phone_case":{"type":"text","norms":false,"analyzer":"lowercaseWhiteSpaceAnalyzer"}

示例文档:

 [
  {
      id: "1",
      phone_case: "khaki,brushed and polished",
  },
  {
      id: "2",
      phone_case: "green,brushed",
  },
  {
      id: "3",
      phone_case: "black,matte"
  }
]

“phone_case”字段是一个文本字段。

当我搜索 卡其色我想查找仅包含 的文档卡其色结果，不包括任何包含 的结果绿色 .另一方面，当搜索绿色，我想通过 获取文档绿色 或 卡其色 .这应该是类型扩展应该做的。

术语级别查询适用于以下目的:

{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "phone_case": "khaki"
        }
      }
    }
  }

它设法返回包含 的文档卡其色 .

但与 匹配短语 ，它返回带有 的文档卡其色或绿色 .这不是我所期望的。我想获取包含 的文档卡其色 ，而不是绿色 :

{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "match_phrase": {
      "phone_case": "khaki"
    }
  }
}

谁能告诉我 有什么问题？比赛查询无法排除包含“绿色”的结果？我想允许用户按确切顺序查找文本字段，但比赛或 匹配短语 不适用于类型扩展同义词。

最佳答案

根据 elastic documentation , 当我们定义我们的同义词时，如 a => b,c , 即解决为

# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS.  These types of mappings
# ignore the expand parameter in the schema.

所以对于你的情况，"khaki => khaki,green" , 字khaki将替换为 khaki和 green .这可以使用 analyze 查看。 api之类的

GET stack-57703209/_analyze
{
  "text": "khaki",
  "analyzer": "lowercaseWhiteSpaceAnalyzer"
}

这将返回两个 token khaki和 green .

{
  "tokens" : [
    {
      "token" : "khaki",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "green",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

如果您检查 green

GET stack-57703209/_analyze
{
  "text": "green",
  "analyzer": "lowercaseWhiteSpaceAnalyzer"
}

您将只获得一个 token green .

{
  "tokens" : [
    {
      "token" : "green",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

根据您的问题，您想要相反，因此理想情况下，同义词应如下所示

"green => khaki,green"
 not "khaki => khaki,green"

除此之外，您还在 index 应用此分析器。时间。因此，在索引您的文档时，khaki 这个词替换为 khaki和 green正如我们在上面使用分析 api 看到的那样。

当你运行 Term Query , 搜索确切的词

{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "phone_case": "khaki"
        }
      }
    }
  }

如果您搜索 khaki ，您只会得到第一个响应结果，因为 Term Query 不应用任何搜索分析器并匹配确切的术语，因此它会查找 khaki第二个文档是 phone_case: "green,brushed",没有带有 khaki 的 token (可以使用分析 api 进行检查)，因此不会返回。

但是，Match Query , 应用与 index analyzer 相同的分析器默认情况下，在您的情况下 lowercaseWhiteSpaceAnalyzer .所以这两个文件都被退回了。

因此，根据您的要求，您需要一个 search analyzer而不是 index analyzer , 因此您可以将索引设置更改为

{
  "settings": {
    "analysis": {
      "char_filter": {
        "same_word": {
          "type": "mapping",
          "mappings": [
            "-=>",
            "&=>and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": "a, an"
        },
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "green => khaki,green",      //NOTE THIS
            "cat => cat,pet"
          ],
          "tokenizer": "whitespace"
        }
      },
      "analyzer": {
        "lowercaseWhiteSpaceAnalyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "same_word"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
            ]
        },
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "my_synonym"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }
}

然后指定 search analyzer相反，像

{
    "query": {
        "match_phrase": {
            "phone_case" : {
                "query" : "green",
                "analyzer" : "synonym_analyzer"  // NOTE THIS

            }
        }
    }
}

This博客更详细地解释了这一点。
希望这可以帮助!!

关于node.js - match_phrase 不适用于 Elasticsearch 中的同义词标记过滤器(类型扩展)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57703209/

node.js - match_phrase 不适用于 Elasticsearch 中的同义词标记过滤器(类型扩展)

上一篇：json - 将 JSON 变量传递给 AzureDevops Release 定义

下一篇：powershell - 在PowerShell中等号后是否需要使用反引号？