elasticsearch - 在ElasticSearch查询和索引中使用逗号分隔的字符串进行搜索

标签 elasticsearch

我正在尝试使用一个字符串进行搜索,该字符串包含多个以逗号分隔的字符串。 [可能与整个值文本不匹配,可以是部分值,传递的项目应在文本中]
注意:我也尝试过n-gram,但是它不能提供正确的数据。
(例如:搜索词“数据科学”给出所有“数据”,“科学”,“数据科学”)
文档在ES中:

{
                "_index": "questions_dev",
                "_type": "_doc",
                "_id": "188",
                "_score": 6.6311107,
                "_source": {
                    "questionId": 188,
                    "questionText": "What other social media platforms do you use on your own time?",
                    "domainId": 2,
                    "subdomainId": 25,
                    "type": "TEXT",
                    "difficulty": 1,
                    "time": 600,
                    "domain": "Domain Specific",
                    "subdomain": "Social Media Specialist",
                    "skill": ["social media"]
                }
            }
我到目前为止所做的:
索引:
{
     "settings": {
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "custom_tokenizer",
                    "filter": ["lowercase"]
                }
            },
            "tokenizer": {
                "custom_tokenizer": {
                    "type": "pattern",   
                    "pattern": ",",
                },
            }
        }
    },
    "mappings": {
        "properties": {
            "questionId": {
                "type": "long"
            },
            "questionText": {
                "type": "text",
            },
            "domain": {
                "type": "text"
            },
            "subdomain": {
                "type": "text"
            },
            "type":{
                "type": "keyword"
            },
            "difficulty":{
                 "type": "keyword"
            },
            "totaltime":{
                 "type": "keyword"
            },
            "domainId":{
                 "type": "keyword"
            },
            "subdomainId":{
                 "type": "keyword"
            }
        }
    }
}
查询:
{
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "fields": ["questionText","skill"],
                        "query": "social media"
                    }
                }
            ]
        }
    }
}
输出:
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}
预期产量:
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 6.6311107,
        "hits": [
            {
                "_index": "questions_development",
                "_type": "_doc",
                "_id": "188",
                "_score": 6.6311107,
                "_source": {
                    "questionId": 188,
                    "questionText": "What other social media platforms do you use on your own time?",
                    "domainId": 2,
                    "subdomainId": 25,
                    "type": "TEXT",
                    "difficulty": 1,
                    "time": 600,
                    "domain": "Domain Specific",
                    "subdomain": "Social Media Specialist",
                    "skill": []
                }
            }
        ]
    }
}
目标:
使用字符串搜索所有包含该字符串的文档。
例:
如果我使用"social media"搜索,则应该返回上述文档。
(就我而言,它没有返回。)
此搜索还应支持以逗号分隔的搜索机制。
这意味着,我可以传递“社交媒体,自己的时间”,并且期望输出的questionTexts文本包含这些字符串中的任何一个。

最佳答案

您正在索引 social media, own time的数据包含,own time之间的空格。因此,您先前的映射生成的 token 为:

{
"tokens": [
    {
        "token": " social media",  <-- note the preceding whitespace here
        "start_offset": 0,
        "end_offset": 12,
        "type": "word",
        "position": 0
    },
    {
        "token": " own time",   <-- note the preceding whitespace here
        "start_offset": 13,
        "end_offset": 22,
        "type": "word",
        "position": 1
    }
]
}
因此,在搜索查询中,当您使用"query": "social media"时(没有空格),开始时不会显示搜索结果。但是,如果以这种方式查询"query": " social media"(开头包含空格),则搜索结果将在那里。
要从流中的每个 token 中删除开头和结尾的空格,可以使用Trim Token filter
添加带有索引数据,映射和搜索查询的工作示例
索引映射:
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "custom_tokenizer",
          "filter": [
            "lowercase",
            "trim"                            <-- note this
          ]
        }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "pattern",
          "pattern": ",",
          "filter": [
            "trim"                            <-- note this
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "questionText": {
        "type": "text"
      }
    }
  }
}
索引数据:
{ "questionText": "social media" }
{ "questionText": "social media, own time" }
搜索查询:
    {
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "fields": [
              "questionText"
            ],
            "query": "own time"    <-- no whitespace included in the 
                                       beginning
          }
        }
      ]
    }
  }
}
搜索结果:
"hits": [
  {
    "_index": "my-index",
    "_type": "_doc",
    "_id": "2",
    "_score": 0.60996956,
    "_source": {
      "questionText": "social media, own time"
    }
  }
更新1:
索引设置
    {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}
索引数据:
{
    "questionText": "What other platforms do you use on your ?"
}
{
    "questionText": "What other social time platforms do you use on your?"
}
{
    "questionText": "What other social media platforms do you use on your?"
}
{
    "questionText": "What other platforms do you use on your own time?"
}
搜索查询:
{
"query": {
    "bool": {
        "should": [
            {
                "multi_match": {
                    "fields": "questionText",
                    "query": "social media, own time"
                }
            }
        ]
    }
    }
}
搜索结果
"hits": [
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "1",
    "_score": 2.5628972,
    "_source": {
      "questionText": "What other social media platforms do you use on your own time?"
    }
  },
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "2",
    "_score": 1.3862944,
    "_source": {
      "questionText": "What other social media platforms do you use on your?"
    }
  },
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "3",
    "_score": 1.3862944,
    "_source": {
      "questionText": "What other platforms do you use on your own time?"
    }
  }
]

关于elasticsearch - 在ElasticSearch查询和索引中使用逗号分隔的字符串进行搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63056802/

相关文章:

elasticsearch - Elasticsearch:根据喜欢推荐用户

elasticsearch - SpringData(4.x)Elastic无法正确序列化实体

elasticsearch - 在Elasticsearch中将int转换为时间戳获取小时

ElasticSearch HTTP 客户端与传输客户端

node.js - 启动时嗅探时 Elasticsearch 没有事件连接

elasticsearch - 如何构造Elasticsearch以仅过滤具有子域的URL?

internet-explorer - Kibana (https) 无法在 Internet Explorer 中访问 ElasticSearch (http)

elasticsearch - Elasticsearch查询哪些字段具有给定类型?

elasticsearch - BULK API : Malformed action/metadata line [3], 需要 START_OBJECT 但找到 [VALUE_STRING]

elasticsearch - Elasticsearch-用于非日志数据可视化的Kibana