我正在尝试实现一个Elasticsearch pattern_capture过滤器,该过滤器可以将EDR-00004转换为 token :[EDR-00004,00004,4]。我(仍)在使用Elasticsearch 2.4,但与当前ES版本的文档没有什么不同。
我遵循了文档中的示例:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html
这是我的测试和结果:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "pattern",
"filter": ["process_number_filter"]
}
}
}
}
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "process_number_analyzer",
"text": "EDR-00002"
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "standard",
"tokenizer": "standard",
"filter": ["process_number_filter"],
"text": "EDR-00002"
}'
返回值:
{"acknowledged":true}
{
"tokens": [{
"token": "EDR",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
}]
}
{
"tokens": [{
"token": "edr",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "<NUM>",
"position": 1
}]
}
我明白那个
还请确保我的正则表达式正确。
>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")
>>> m.groups()
('EDR-00004', '00004', '4')
最佳答案
我不愿回答自己的问题,但是我找到了答案,也许可以在将来对人们有所帮助。
我的问题是默认的 token 生成器,它将在将文本传递到我的过滤器之前将其拆分。通过添加我自己的 token 生成器,该 token 生成器将默认的分隔符"\W+"
覆盖为"[^\\w-]+"
,我的过滤器接收了整个单词,从而创建了正确的 token 。
现在这是我的自定义设置:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"tokenizer": {
"process_number_tokenizer": {
"type": "pattern",
"pattern": "[^\\w-]+"
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "process_number_tokenizer",
"filter": ["process_number_filter"]
}
}
}
}
}'
导致以下结果:
{
"tokens": [
{
"token": "EDR-00002",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "00002",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "2",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
关于elasticsearch - 无法实现pattern_capture token 过滤器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42623381/