elasticsearch - 如何在ElasticSearch中标记罗马数字术语?

标签 elasticsearch lucene tokenize elasticsearch-analyzers

通过以下方式注册 token 字符来创建 token 化程序时,无法注册罗马字母“X”。(测试ES版本:ES6.7,ES5.6)

      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 14,
          "token_chars": [
            "Ⅹ"
          ]
        }
    }

错误日志是这样的

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node02][192.168.115.x:9300][indices:admin/create]"}],"type":"illegal_argument_exception","reason":"Unknown token type: 'ⅹ', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]"},"status":400}



如何将罗马数字标记为术语?

最佳答案

错误消息明确指出您的罗马X不是有效的token type。该错误消息还列出了token type的有效选项,如下所示:

must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]



如果您将官方ES文档doct https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html用作 token 字符,则语法就存在问题,您可以理解其含义,如下所述:

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).



在其下面再次将有效值指定为digitletter,同样的链接也提供了一些示例,其中他们将token_chars与有效值一起使用。

如果在分析器设置中将X替换为letter,则可以解决您的问题。

关于elasticsearch - 如何在ElasticSearch中标记罗马数字术语?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60201909/

相关文章:

java - Elasticsearch Java High Level Client 7.3.1:聚合以搜索响应中的索引明智命中

java - Hibernate Search (Lucene) 近实时索引管理器和 Tomcat 并行部署

c++ - 从文件中使用 C++ 中的分词器?

python - 将作者字符串划分为作者

azure - 如何在Azure搜索索引中获取字符匹配而不是子字符串

java - Elasticsearch 索引模板刷新

elasticsearch - Elasticsearch匹配数组用法

django - Django Haystack:Heroku Searchbox插件无法运行rebuild_index

java - CompassQuery - 仅保留关键字,不保留别名或运算符

java - 执行者服务中的执行者服务?