algorithm - 当段落包含Elasticsearch索引中的句子时匹配

标签 algorithm elasticsearch search text full-text-search

我使用elasticsearch创建一个程序,该程序可以查找文本中引用圣经的所有地方以及提到经文的地方
我在elasticsearch中索引了圣经的所有经节,每节经文都是一个文档
当我通过部分输入经文进行搜索时,我会找到正确的结果(即使是犯错误)
如何浏览文本以找到引用某节经文(甚至部分经文)的所有情况,从而将经文的来源归因于它们?和容错(使用模糊参数或使用我认为的同义词)

我的索引示例:

{"index":{"_index":"test","_type":"","_id":1}}
{"fields":{"year":3560,"book":"1","chapter":1,"section":1,"text":"others words consectetur adipiscing and others words"},"id":"test1","type":"add"}
{"index":{"_index":"test","_type":"","_id":2}}
{"fields":{"year":3560,"book":"2","chapter":3,"section":2,"text":"others words a sagittis nisl quam and others words"},"id":"test2","type":"add"}
{"index":{"_index":"test","_type":"","_id":3}}
{"fields":{"year":3560,"book":"3","chapter":1,"section":5,"text":"others words Aliquam ultrices auctor pharetra and others words"},"id":"test3","type":"add"}
{"index":{"_index":"test","_type":"","_id":4}}
{"fields":{"year":3560,"book":"4","chapter":2,"section":4,"text":"others words Proin ut vestibulum and others words"},"id":"test4","type":"add"}
{"index":{"_index":"test","_type":"","_id":5}}
{"fields":{"year":3560,"book":"5","chapter":1,"section":5,"text":"others words Aenean pretium tincidunt aliquet and others words"},"id":"test5","type":"add"}
{"index":{"_index":"test","_type":"","_id":6}}
{"fields":{"year":3560,"book":"6","chapter":2,"section":1,"text":"others words In vitae sagittis and others words"},"id":"test6","type":"add"}
{"index":{"_index":"test","_type":"","_id":7}}
{"fields":{"year":3560,"book":"7","chapter":7,"section":7,"text":"others words ligula laoreet pharetra and others words"},"id":"test7","type":"add"}
{"index":{"_index":"test","_type":"","_id":8}}
{"fields":{"year":3560,"book":"8","chapter":1,"section":4,"text":"others words luctus eros a pretium and others words"},"id":"test8","type":"add"}
{"index":{"_index":"test","_type":"","_id":9}}
{"fields":{"year":3560,"book":"9","chapter":1,"section":7,"text":"others words ullamcorper eu id quam and others words"},"id":"test9","type":"add"}
{"index":{"_index":"test","_type":"","_id":10}}
{"fields":{"year":3560,"book":"10","chapter":5,"section":4,"text":"others words Nullam ac enim ac lacus hendrerit and others words"},"id":"test10","type":"add"}

我需要找到索引中该段中所有出现的内容,以便恢复其来源:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla rhoncus, nulla vitae porta euismod, purus nisl faucibus nunc, a sagittis nisl quam id arcu. Sed sit amet arcu sed dui auctor bibendum. Proin ut vestibulum sem, id rutrum felis. Phasellus sagittis justo sit amet justo consequat, id scelerisque eros cursus. Quisque dapibus finibus euismod. Proin dui urna, auctor ut gravida quis, fringilla quis velit. Donec sed pulvinar leo. Sed pulvinar pharetra arcu nec egestas. Mauris non dapibus diam. Pellentesque quis pellentesque libero. Aliquam ultrices auctor pharetra. Cras ullamcorper, odio sit amet aliquam convallis, magna nibh gravida nunc, sit amet volutpat elit purus eget lectus. Pellentesque eu est a risus euismod consequat. Duis id erat porttitor, sodales justo non, aliquet ex. Etiam tincidunt neque ut nisi commodo auctor. Sed congue urna ac tellus scelerisque hendrerit. Mauris lobortis sed dui ut varius. Proin ac luctus felis. In vitae sagittis erat, nec luctus sapien. Aenean pretium tincidunt aliquet. Morbi at enim vel ligula laoreet pharetra. Sed dignissim luctus eros a pretium. Vestibulum molestie molestie nisi, vitae scelerisque nibh bibendum nec. Donec laoreet sapien sed vehicula dictum. Nullam ac enim ac lacus hendrerit tempor et vitae neque. Quisque at leo pretium, efficitur augue vitae, congue eros. Maecenas volutpat ante nec scelerisque vestibulum. Donec tristique orci erat, nec imperdiet nulla commodo ut. Nam non odio vel quam cursus ullamcorper eu id quam. Duis volutpat, nisl eu interdum mattis, augue ipsum mollis leo, eget efficitur orci augue eget leo. Integer feugiat facilisis dolor ut vehicula. Maecenas quis feugiat massa. Curabitur feugiat odio eget ligula tincidunt sodales. Donec feugiat dapibus lectus, non maximus dui rhoncus vitae. Phasellus eget massa faucibus, tristique nibh sed, aliquet metus.



我不知道我是否已经足够清楚,但可以毫不犹豫地问我是否需要更高的精度

我认为这个问题是由Aho-Corasick算法处理的,但我不知道如何将其集成到Elasticsearch中

谢谢!

最佳答案

如果我能够正确理解您的问题,那么您所寻找的就是能够

"some partial verses" : query



并从elasticsearch获取源文档作为响应,结果在其中显示搜索到的经文(这就是突出显示的内容)

这是实现相同的最简单查询
GET <index_name>/_search
{
 "query": {
   "match": {
     "message": "partial verse"
   }
 } ,
    "highlight" : {
        "fields" : {
            "message": {}
        }
    }
}

作为回应,你会得到这样的东西
"hits" : [
      {
        "_index" : "testSample",
        "_type" : "_doc",
        "_id" : "TkdvGXAB5bHyIJQ-QRow",
        "_score" : 0.2876821,
        "_source" : {
          "bookName" : "bible",
          "message" : "this is a good book"
        },
        "highlight" : {
          "message" : [
            "<em>this</em> is a good book"
          ]
        }
      }
    ]

回答是不言自明的,您可以在不同的部分获得完整的结果。

关于algorithm - 当段落包含Elasticsearch索引中的句子时匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60105781/

相关文章:

algorithm - 图灵机元素唯一性问题

arrays - 修改数组以最小化差异

algorithm - 检测三角网格内的四面体?

amazon-web-services - 如何删除 AWS ElasticSearch 集群上的一些节点?

elasticsearch - 将 filebeat 日志发送到 Logstash 以使用 docker 元数据进行索引

sql-server - 在 sql server 中搜索 varchar 列的最佳方法

python - 改进 Django 搜索

java - 消除 bin 索引计算中的循环

sql-server - ElasticSearch 没有拉出整个 SQL 表

c# - 从 C# 中的给定整数数组中获取第 k 个公共(public)元素