elasticsearch - 在ElasticSearch查询和索引中使用逗号分隔的字符串进行搜索

我正在尝试使用一个字符串进行搜索，该字符串包含多个以逗号分隔的字符串。 [可能与整个值文本不匹配，可以是部分值，传递的项目应在文本中]
注意:我也尝试过n-gram，但是它不能提供正确的数据。
(例如:搜索词“数据科学”给出所有“数据”，“科学”，“数据科学”)
文档在ES中:

{
                "_index": "questions_dev",
                "_type": "_doc",
                "_id": "188",
                "_score": 6.6311107,
                "_source": {
                    "questionId": 188,
                    "questionText": "What other social media platforms do you use on your own time?",
                    "domainId": 2,
                    "subdomainId": 25,
                    "type": "TEXT",
                    "difficulty": 1,
                    "time": 600,
                    "domain": "Domain Specific",
                    "subdomain": "Social Media Specialist",
                    "skill": ["social media"]
                }
            }

我到目前为止所做的:
索引:

{
     "settings": {
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "custom_tokenizer",
                    "filter": ["lowercase"]
                }
            },
            "tokenizer": {
                "custom_tokenizer": {
                    "type": "pattern",   
                    "pattern": ",",
                },
            }
        }
    },
    "mappings": {
        "properties": {
            "questionId": {
                "type": "long"
            },
            "questionText": {
                "type": "text",
            },
            "domain": {
                "type": "text"
            },
            "subdomain": {
                "type": "text"
            },
            "type":{
                "type": "keyword"
            },
            "difficulty":{
                 "type": "keyword"
            },
            "totaltime":{
                 "type": "keyword"
            },
            "domainId":{
                 "type": "keyword"
            },
            "subdomainId":{
                 "type": "keyword"
            }
        }
    }
}

查询:

{
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "fields": ["questionText","skill"],
                        "query": "social media"
                    }
                }
            ]
        }
    }
}

输出:

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

预期产量:

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 6.6311107,
        "hits": [
            {
                "_index": "questions_development",
                "_type": "_doc",
                "_id": "188",
                "_score": 6.6311107,
                "_source": {
                    "questionId": 188,
                    "questionText": "What other social media platforms do you use on your own time?",
                    "domainId": 2,
                    "subdomainId": 25,
                    "type": "TEXT",
                    "difficulty": 1,
                    "time": 600,
                    "domain": "Domain Specific",
                    "subdomain": "Social Media Specialist",
                    "skill": []
                }
            }
        ]
    }
}

目标:
使用字符串搜索所有包含该字符串的文档。
例:
如果我使用"social media"搜索，则应该返回上述文档。
(就我而言，它没有返回。)
此搜索还应支持以逗号分隔的搜索机制。
这意味着，我可以传递“社交媒体，自己的时间”，并且期望输出的questionTexts文本包含这些字符串中的任何一个。

最佳答案

您正在索引 social media, own time的数据包含,和own time之间的空格。因此，您先前的映射生成的 token 为:

{
"tokens": [
    {
        "token": " social media",  <-- note the preceding whitespace here
        "start_offset": 0,
        "end_offset": 12,
        "type": "word",
        "position": 0
    },
    {
        "token": " own time",   <-- note the preceding whitespace here
        "start_offset": 13,
        "end_offset": 22,
        "type": "word",
        "position": 1
    }
]

}
因此，在搜索查询中，当您使用"query": "social media"时(没有空格)，开始时不会显示搜索结果。但是，如果以这种方式查询"query": " social media"(开头包含空格)，则搜索结果将在那里。
要从流中的每个 token 中删除开头和结尾的空格，可以使用Trim Token filter
添加带有索引数据，映射和搜索查询的工作示例
索引映射:

{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "custom_tokenizer",
          "filter": [
            "lowercase",
            "trim"                            <-- note this
          ]
        }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "pattern",
          "pattern": ",",
          "filter": [
            "trim"                            <-- note this
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "questionText": {
        "type": "text"
      }
    }
  }
}

索引数据:

{ "questionText": "social media" }
{ "questionText": "social media, own time" }

搜索查询:

    {
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "fields": [
              "questionText"
            ],
            "query": "own time"    <-- no whitespace included in the 
                                       beginning
          }
        }
      ]
    }
  }
}

搜索结果:

"hits": [
  {
    "_index": "my-index",
    "_type": "_doc",
    "_id": "2",
    "_score": 0.60996956,
    "_source": {
      "questionText": "social media, own time"
    }
  }

更新1:
索引设置

    {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

索引数据:

{
    "questionText": "What other platforms do you use on your ?"
}
{
    "questionText": "What other social time platforms do you use on your?"
}
{
    "questionText": "What other social media platforms do you use on your?"
}
{
    "questionText": "What other platforms do you use on your own time?"
}

搜索查询:

{
"query": {
    "bool": {
        "should": [
            {
                "multi_match": {
                    "fields": "questionText",
                    "query": "social media, own time"
                }
            }
        ]
    }
    }
}

搜索结果

"hits": [
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "1",
    "_score": 2.5628972,
    "_source": {
      "questionText": "What other social media platforms do you use on your own time?"
    }
  },
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "2",
    "_score": 1.3862944,
    "_source": {
      "questionText": "What other social media platforms do you use on your?"
    }
  },
  {
    "_index": "my-index3",
    "_type": "_doc",
    "_id": "3",
    "_score": 1.3862944,
    "_source": {
      "questionText": "What other platforms do you use on your own time?"
    }
  }
]

关于elasticsearch - 在ElasticSearch查询和索引中使用逗号分隔的字符串进行搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63056802/

elasticsearch - 在ElasticSearch查询和索引中使用逗号分隔的字符串进行搜索

上一篇：python - 在Tkinter中绘制波形图(Python)

下一篇：c# - C#记录麦克风输入并将其存储在字节数组中，而不是本地存储中