android - Android 中使用 Sqlite 全文搜索对非英语字符的 Unicode 支持

滚动到末尾可跳过解释。

背景

在我的 Android 应用程序中，我想使用非英语 Unicode 文本字符串来搜索存储在 SQLite 数据库中的文本文档/字段中的匹配项。我了解到(所以我认为)我需要做的是实现 Full Text Search with fts3/fts4 ，这就是我过去几天一直在努力学习的内容。 Android 支持 FTS，如文档中所示 Storing and Searching for Data并在博文中Android Quick Tip: Using SQLite FTS Tables .

问题

一切看起来都不错，但后来我阅读了 2012 年 3 月的博文 The sorry state of SQLite full text search on Android , 表示

The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).

SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.

之后我还找到了this StackOverflow answer通过 @CL. (根据标签和声誉，他似乎是 SQLite 专家)回答有关将越南字母与不同变音符号匹配的问题:

You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61.

Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers.

This 2011 SO answer似乎证实 Android 不支持超出两个基本 simple 和 porter 的分词器。

现在是 2015 年了。这种情况有什么更新吗？我需要让所有使用我的应用程序的人都支持全文搜索，而不仅仅是使用新手机的人(即使最新的 Android 版本现在支持它)。

可能的部分解决方案？

我很难相信 FTS 根本不能与 Unicode 一起工作。 documentation对于 simple 分词器说

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added)

这让我希望 Android 仍然可以支持一些基本的 Unicode 功能，即使不支持大写和变音符号(以及具有不同 Unicode 代码点的各种其他等效字母形式)。

我的主要问题

如果我只使用由空格分隔的文字 Unicode 字符串标记，我可以在 Android 中使用带有非英语 Unicode 文本(代码点 > 128)的 SQLite FTS 吗？ (也就是说，我正在搜索文本中出现的确切字符串。)

更新

unicode61 tokenizer在 SQLite 版本 3.7.13 中可用。这个分词器支持“完整的 unicode 大小写折叠”和“识别 unicode 空格和标点字符”。 Android Lollipop (API 20+) uses SQLite 3.8 .

最佳答案

补充回答

我最终按照@CL 的建议进行了操作，并成功地实现了使用 Unicode 的全文搜索。这些是我遵循的基本步骤:

用空格字符替换所有不属于单词的 Unicode 字符 (>= 128)。
(可选)用更通用的字符替换特定字符。例如，ē、è 和 é 都可以替换为 e(如果这种通用搜索是需要的)。这不是必需的，但如果你不这样做，那么搜索é将只返回带有é的文档，而搜索e将只返回带有 e 的文档(而不是 é)。
使用在步骤 1 和 2 中创建的修改后的文本填充虚拟 FTS 表。
用未修改的文本填充您的普通表格。当然，架构和文档数量必须与您创建 FTS 表时的相同。
使用外部内容表将虚拟 FTS 表与您的普通文本表/列链接起来，这样您就不会存储修改后文本的副本，而只会存储从该文本创建的文档 ID。

请阅读Full text search example in Android有关如何创建 FTS 表并将其链接到普通表的说明。这花了很长时间才弄清楚，但最终即使对大量文档也能进行非常快速的全文搜索。

如果您需要更多详细信息，请在下面发表评论。

关于android - Android 中使用 Sqlite 全文搜索对非英语字符的 Unicode 支持，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29669342/

android - Android 中使用 Sqlite 全文搜索对非英语字符的 Unicode 支持

背景

问题

可能的部分解决方案？

我的主要问题

更新

上一篇：android - 结合 layout_weight 和 maxHeight

下一篇：Java runOnUiThread 和 Thread.sleep