elasticsearch - Elasticsearch 6.8 match_phrase搜索N元语法分词器效果不佳

标签 elasticsearch tokenize n-gram match-phrase

我使用Elasticsearch N-gram tokenizer并使用match_phrase进行模糊匹配
我的索引和测试数据如下:

DELETE /m8
PUT m8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 3,
          "custom_token_chars":"_."
        }
      }
    },
    "max_ngram_diff": 10
  },
  "mappings": {
    "table": {
      "properties": {
        "dataSourceId": {
          "type": "long"
        },
        "dataSourceType": {
          "type": "integer"
        },
        "dbName": {
          "type": "text",
          "analyzer": "my_analyzer",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}


PUT /m8/table/1
{
  "dataSourceId":1,
  "dataSourceType":2,
  "dbName":"rm.rf"
}

PUT /m8/table/2
{
  "dataSourceId":1,
  "dataSourceType":2,
  "dbName":"rm_rf"
}
PUT /m8/table/3
{
  "dataSourceId":1,
  "dataSourceType":2,
  "dbName":"rmrf"
}
检查_analyze:
POST m8/_analyze
{
  "tokenizer": "my_tokenizer",
  "text": "rm.rf"
}
_分析结果:
{
  "tokens" : [
    {
      "token" : "r",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rm",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "rm.",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "m",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "m.",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "m.r",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : ".",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : ".r",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : ".rf",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "r",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "rf",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "f",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 11
    }
  ]
}
当我搜索“rm”时,什么都没有找到:
GET /m8/table/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "dbName": "rm"
          }
        }
      ]
    }
  }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
但是可以找到“.rf”:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.7260926,
    "hits" : [
      {
        "_index" : "m8",
        "_type" : "table",
        "_id" : "1",
        "_score" : 1.7260926,
        "_source" : {
          "dataSourceId" : 1,
          "dataSourceType" : 2,
          "dbName" : "rm.rf"
        }
      }
    ]
  }
}
我的问题:
为什么即使_analyze拆分了这些短语也找不到“rm”?

最佳答案

  • my_analyzer也将在搜索期间使用。
    "mapping":{
     "dbName": {
      "type": "text",
      "analyzer": "my_analyzer" 
      "search_analyzer":"my_analyzer"  // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
    
  • Match_phrase查询用于考虑已分析文本的位置来匹配短语。例如,搜索“Kal ho”将匹配分析文本中X位置具有“Kal”和X + 1位置具有“ho”的文档。
  • 当您搜索“rm”(#1)时,将使用my_analyzer分析文本,该文本将其转换为n-gram,并将在该phrase_search的顶部使用。因此,结果是无法预期的。

  • 解决方案:
  • 将标准分析器与简单匹配查询一起使用
    GET /m8/_search
    {
     "query": {
     "bool": {
       "must": [
         {
           "match": {
             "dbName": {
               "query": "rm",
               "analyzer": "standard" // <=========
             }
           }
         }
       ]
     }
     }
     }
    
    在映射过程中定义并使用匹配查询(而非match_phrase)
    "mapping":{
          "dbName": {
           "type": "text",
           "analyzer": "my_analyzer" 
           "search_analyzer":"standard" //<==========
    

  • 后续问题:为什么要在n-gram标记器中使用 match_phrase 查询?

    关于elasticsearch - Elasticsearch 6.8 match_phrase搜索N元语法分词器效果不佳,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64277914/

    相关文章:

    python - Django:自动建议使用Elasticsearch DSL

    ElasticSearch - EdgeNGram 标记生成器的问题

    data-mining - 使用 n-gram 模型自动文本分类

    JavaScript:使用 String.split 和正则表达式优先级避免空字符串

    c - 根据规则对从文件中读取的文本进行标记

    python - 如何执行 ngram 到 ngram 关联

    python - 查找最佳子字符串匹配

    python - 使用python将elasticsearch编辑为表

    elasticsearch - camel-elasticsearch 2.11.x不能远程工作吗?

    docker - 将 ElasticSearch Docker 容器部署到 AWS Fargate