elasticsearch - 如何在 Elasticsearch 中检查为不同标记器生成的标记

我一直在使用不同类型的分词器来进行测试和演示。我需要检查特定文本字段是如何使用不同的分词器进行分词的，还需要查看生成的分词。

我怎样才能做到这一点？

最佳答案

您可以使用 _analyze endpoint为此目的。

例如，使用标准分析器，您可以分析this is a test像这样

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'

这会产生以下标记:

{
  "tokens" : [ {
    "token" : "this",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "is",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "a",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "test",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}

当然，您可以使用 existing analyzers 中的任何一个您还可以使用 tokenizer 指定分词器参数， token 过滤器使用 token_filters使用 char_filters 的参数和字符过滤器范围。例如，分析 HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'使用标准分析仪，keyword分词器，lowercase token 过滤器和 html_strip字符过滤器产生这个，即没有 HTML 标记的小写单个标记:

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'

{
  "tokens" : [ {
    "token" : "this is a test",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "word",
    "position" : 1
  } ]
}

关于elasticsearch - 如何在 Elasticsearch 中检查为不同标记器生成的标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30930428/

elasticsearch - 如何在 Elasticsearch 中检查为不同标记器生成的标记

上一篇：Elasticsearch 从日期字段按年份过滤

下一篇：elasticsearch - 如何获取最后到达的文档 ElasticSearch？