java - 像谷歌一样的全文搜索

我想在我的离线 (android) 应用程序中实现全文搜索以搜索用户生成的笔记列表。

我希望它像谷歌一样运行(因为大多数人已经习惯了向谷歌查询)

我的初始要求是:

快速:与 Google 类似或尽可能快，拥有 100000 个文档，每个文档包含 20000 个单词。
搜索两个词应该只返回包含两个词的文档(而不是一个词)(除非使用 OR 运算符)
不区分大小写(又名:规范化):如果我有单词“Hello”并且我搜索“hello”，它应该匹配。
变音符号不敏感:如果我有“así”这个词，则搜索“asi”应该匹配。在西类牙语中，许多人错误地没有放置变音符号或未能正确放置它们。
停用词消除:为了不拥有庞大的索引，根本不应该对“and”、“the”或“for”等无意义的词进行索引。
词典替换(又名:词干):相似词应作为一个词编入索引。例如，“hungrily”和“hungry”的实例应替换为“hunger”。
短语搜索:如果我有文本“Hello world!”搜索“world hello”不应匹配，但搜索“hello world”应该匹配。
如果没有指定字段(不仅仅是默认字段)，则搜索所有字段(在多字段文档中)
在键入时自动完成搜索结果以提供热门搜索。 (就像谷歌建议)

如何配置全文搜索引擎以尽可能像 Google 一样运行？

(我最感兴趣的是开源、Java，尤其是 Lucene)

最佳答案

我认为Lucene可以满足您的要求。您还应该考虑使用 Solr ，它具有类似的功能并且更容易设置。

我将使用 Lucene 分别讨论每个需求。我相信 Solr 也有类似的机制。

Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.

对于 Lucene 和 Solr 来说，这是一个合理的索引大小，能够在每次查询几十毫秒的时间内实现检索。

Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)

您可以使用 BooleanQuery 来做到这一点与 MUST作为 Lucene 中的默认设置。

接下来的四个需求可以通过定制一个Lucene来处理Analyzer :

Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.

A LowerCaseFilter可以用于此。

Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.

这需要 Unicode 标准化，然后删除变音符号。您可以为此构建自定义分析器。

Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.

A StopFilter删除 Lucene 中的停用词。

Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.

Lucene 有很多 Snowball Stemmers .其中之一可能是合适的。

Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.

这是由 Lucene 涵盖的 PhraseQuery专门查询。

如您所见，Lucene 涵盖了所有必需的功能。为了获得更全面的了解，我建议阅读这本书 Lucene in Action , The Apache Lucene Wiki或 The Lucid Imagination Site .

关于java - 像谷歌一样的全文搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1977815/

java - 像谷歌一样的全文搜索

上一篇：java - 外键未存储在子实体中(一对多)

下一篇：java - 从应用程序中请求 java 堆转储(核心转储)