python - trie 的快速序列化

我的应用程序的一部分使用 trie至chunk话在一起。例如，["Summer", "in", "Los", "Angeles"] 变为 ["Summer", "in", "Los Angeles"]。

现在，这个特里树从 a large database 开始填充，在应用程序启动时以 SQL 形式本地存储。时间比较长，15s左右。我想减少应用程序的启动时间，因此我研究了序列化 Trie。不幸的是，pickling太慢了 - 比从数据库加载所有内容都慢。

是否有更快的方法来序列化我的字典树？

这是 Trie 类的样子:

class Trie:
    def __init__(self):
        self.values = set()
        self.children = dict()

    def insert(self, key, value):
        """Insert a (key,value) pair into the trie.  
        The key should be a list of strings.
        The value can be of arbitrary type."""
        current_node = self
        for key_part in key:
            if key_part not in current_node.children:
                current_node.children[key_part] = Trie()
            current_node = current_node.children[key_part]
        current_node.values.add(value)

    def retrieve(self, key):
        """Returns either the value stored at the key, or raises KeyError."""
        current_node = self
        for key_part in key:
            current_node = current_node.children[key_part]
        return current_node.values

有什么方法可以改变它，使其更加可序列化吗？

最佳答案

我知道我没有给出 python 答案，但这仍然可能有用:

创建、压缩和存储 trie 确实是一项艰巨的任务。我花了相当多的时间思考自动建议的数据结构，据我所知，最优雅的解决方案是由 Giuseppe Ottaviano 和 partly described in my blog article 提供的。

尽管实现 Ottaviano 的整个解决方案没有意义 as described in his paper在 python 中，遵循基本思想将完整的 trie 存储为一大块内存并且仅引用下一个跳转的位置可能仍然有意义。

通过这种方式你可以轻松地将这个数组或内存块序列化到硬盘上。我对 python 不太确定，但我认为这个操作应该可以工作，并且比序列化数据结构要快得多。

我知道 Ottavianos 工作的 C 实现存在，您甚至可以使用 python C 绑定(bind)。

关于python - trie 的快速序列化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23284371/

python - trie 的快速序列化

上一篇：python - 无法在 python 应用程序的 app.yaml 中设置缓存过期

下一篇：python - 根据单击的按钮验证 WTForm 表单