elasticsearch - Elasticsearch发布荧光笔返回太多句子

标签 elasticsearch lucene

我的帖子荧光笔有问题。根据文档:
“...张贴荧光笔...输出句子，无论其长度如何。”

因此，通过设置:"number_of_fragments" : 1我应该只还一句话。这是90％的情况，但是有时我会得到很长的文本，显然超过一句话。例如:(突出显示的单词是河和污染)

It is a collegiate body with an advisory and deliberative of the Integrated Water Resources Management - working on Unit Water Resources Management 10, built by the state, municipalities and civil society, equally. [ 2 ] This committee took the initiative of civil society and currently includes 34 municipalities, 18 were located in Sorocaba River basin and 16 situated in the sub-basin of the upper Middle Tietê. [ 3 ] It has been a very polluted river due to industrial activities, mining, sewage without treatment, etc.

共有3个句子，前两个甚至没有加亮的单词。
我认为这里存在一个错误，导致帖子荧光笔忽略“。”。当后跟'['时。我注意到在所有不良的突出显示结果中都是这种情况。

这是一个已知的错误？还是我错过了什么？
谢谢

最佳答案

本质上，我不确定我是否认为这是一个错误。句子的边界并不像分割句那么简单(您不想破坏“3.14”或“史密斯先生”)，而且常常是模棱两可的。
PostingsHighlighter使用java.text.BreakIterator检测在哪里分解句子。我以为BreakIterator的行为是基于UAX #29的，但是这种行为与you can try it here不太一致。

因此，很可能是java.text.BreakIterator中的错误，或者这可能只是其算法的工作方式。

关于elasticsearch - Elasticsearch发布荧光笔返回太多句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36306189/

上一篇：javascript - 我如何才能使此音频文件仅在列出的时间播放而不在之后的任何时间播放？

下一篇：C++ - 将 DirectSound 或 XAudio2 与 EAX 一起使用

python - 如何通过Python TypeError更新或插入Elasticsearch？

elasticsearch - 如何在运行时从 Solr 中过滤大量 id 列表

search - elasticsearch ngrams:为什么匹配较短的 token 而不是较长？

lucene - 在 JDK6 上使用 Solr 索引的文档可以在 JDK1.4 上仅使用 lucene api 检索吗？

elasticsearch - Logstash API配置http

macos - Elasticsearch 无法在 OSX 上启动或运行

elasticsearch - Elasticsearch-查询具有不同术语的主要和次要属性

lucene - 我可以按多值字段的成员搜索 Solr 文档吗？

lucene - 使用Lucene，如果有人搜索 “red barn”，您还如何返回包含 “redbarn”的结果？