正则表达式从不工作 Elasticsearch 6.* 开始

标签 regex elasticsearch lucene

我在理解 ElasticSearch 中的正则表达式机制时遇到了麻烦。我有代表属性(property)单位的文件:

{
    "Unit" :
    {
         "DailyAvailablity" : 
         "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
    }
}

DailyAvailability 字段对从今天开始的 future 两年内按天计算的属性(property)可用性进行编码。 'A'表示可用,'U'不可用,'I'可以 checkin ,'O'可以 checkout 。我如何编写正则表达式过滤器来获取特定日期可用的所有单元?

我试图在 DailyAvailability 字段中找到具有特定长度和偏移量的“A”子字符串。例如,要查找从今天起 7 天内可用 7 天的单位:

{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
        }
      ]
    }
  }
}

此查询返回具有从“UUUUUUUUUUUUUUUUUUUIAA”开始的 DateAvailability 的实例单元,但在字段内的某个位置包含合适的序列。如何锚定整个源字符串的正则表达式? ES 文档说 lucene 正则表达式应该默认锚定。

附言我试过 '^.{7}a{7}.*$'。返回空集。

最佳答案

看起来您正在使用 text存储 Unit.DailyAvailability 的数据类型(如果您使用 dynamic mapping,这也是字符串的默认类型)。您应该考虑使用 keyword数据类型。

让我更详细地解释一下。

为什么我的正则表达式匹配 text 字段中间的内容?

text 数据类型发生的事情是对数据进行全文搜索分析。它会进行一些转换,例如小写和拆分为标记。

让我们尝试使用 Analyze API针对您的输入:

POST _analyze
{
  "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}

响应是:

{
  "tokens": [
    {
      "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 255,
      "end_offset": 510,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 510,
      "end_offset": 732,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

如您所见,Elasticsearch 已将您的输入拆分为三个标记并将它们小写。这看起来出乎意料,但如果您认为它实际上是在尝试促进在人类语言中搜索单词,那是有道理的 - 没有这么长的单词。

这就是为什么现在 regexp 查询 ".{7}a{7}.*" 会匹配:有一个标记实际上以很多 开头>a,这是一个 expected behavior regexp 查询。

...Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field.

如何使 regexp 查询考虑整个字符串?

很简单:不应用分析器。类型keyword按原样存储您提供的字符串。

像这样的映射:

PUT my_regexes
{
  "mappings": {
    "doc": {
      "properties": {
        "Unit": {
          "properties": {
            "DailyAvailablity": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

您将能够执行这样的查询,以匹配帖子中的文档:

POST my_regexes/doc/_search
{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*"  }
        }
      ]
    }
  }
}

请注意,查询变得区分大小写,因为该字段未被分析。

regexp 将不再返回任何结果:".{12}a{7}.*"

这将:".{12}A{7}.*"

那么锚定呢?

正则表达式是 anchored :

Lucene’s patterns are always anchored. The pattern provided must match the entire string.

看起来锚定错误的原因很可能是因为 token 在分析的 text 字段中被拆分。

关于正则表达式从不工作 Elasticsearch 6.* 开始,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50717706/

相关文章:

regex - 多次匹配同一个未知字符

python - 正则表达式的最坏情况分析

Elasticsearch:如何在过滤器上下文中编写 'OR' 子句?

java - 在 Lucene 索引中搜索特定术语

java - Lucene,多词搜索,某一词必须精确匹配

java - Apache lucene 索引

javascript - 引用未被转义字符替换

java - 匹配连续的单个字符作为整个单词

java - 从 Websphere 使用 Elasticsearch Java 客户端时出错

elasticsearch - ACID更新ElasticSearch文档