performance - 在大量字符串中查找重复的长子字符串

我天真地想象我可以构建一个后缀特里树，在其中为每个节点保留一个访问计数，然后计数大于 1 的最深节点就是我要查找的结果集。

我有一个非常非常长的字符串(数百兆字节)。我有大约 1 GB 的 RAM。

这就是为什么用计数数据构建后缀特里树在空间方面对我来说效率太低而无法工作。引用Wikipedia's Suffix tree :

storing a string's suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

那是维基百科对树的评论，而不是 trie。

我如何在如此大量的数据中找到长重复序列，并在合理的时间内(例如，在现代台式机上不到一个小时)？

(一些维基百科链接以避免人们将它们作为“答案”发布:Algorithms on strings 尤其是 Longest repeated substring problem ;-))

最佳答案

执行此操作的有效方法是创建子字符串的索引，并对它们进行排序。这是一个复杂度为 O(n lg n) 的操作。

BWT压缩执行此步骤，因此这是一个很好理解的问题，并且有基数和 suffix (claim O(n)) 对实现等进行排序，使其尽可能高效。仍然需要很长时间，对于大文本可能需要几秒钟。

如果你想使用实用程序代码，C++ std::stable_sort() 比 std::sort() 执行很多自然语言(并且比 C 的 qsort() 快得多，但出于不同的原因)。

然后访问每个项目以查看其与其邻居的公共(public)子串的长度是 O(n)。

关于performance - 在大量字符串中查找重复的长子字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/398811/

performance - 在大量字符串中查找重复的长子字符串

上一篇：algorithm - 找到与输入最相似的字符串的最快方法？

下一篇：algorithm - 用于团发现的 Bron-Kerbosch 算法