python - Python 中基于字典的压缩可单独压缩短字符串

我有数百万个 < 20 个字符的字符串，我想单独压缩每个字符串。

对每个字符串单独使用 zlib 或 lz4 不起作用:输出大于输入:

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
import zlib
for s in inputs:
    c = zlib.compress(s)
    print(c, len(c), len(s))  # the output is larger than the input

Python 中是否有一种方法(可能使用 zlib 或 lz4？)来使用基于字典的压缩，并具有自定义字典大小(例如 64 KB 或 1 MB)是否允许单独压缩非常短的字符串？

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
D = DictionaryCompressor(dictionary_size=1_000_000)
for s in inputs:
    D.update(s)
# now the dictionary is ready    
for s in inputs:
    print(D.compress(s))

注意:“Smaz”看起来很有前途，但它是硬编码的并且不自适应:https://github.com/antirez/smaz/blob/master/smaz.c

最佳答案

Python's zlib interface事实上，从版本 3.3(十年前发布)开始，确实为 compressobj 和 decompressobj 提供了 zdict 参数。

您可以提供最多 32K 的字典来帮助压缩短字符串。我还建议使用原始 deflate 流来最小化大小 (wbits=-15)。

您可以通过多种方式构建 32K 字典。一个好的起点是简单地连接几千个短字符串。看看是否允许压缩你的短字符串。使用字典中不的字符串进行测试。

您也可以尝试zstd它的性能应该比 zlib 更好，并且还支持字典。 zstd 还具有帮助您生成字典的代码。您需要编写自己的 zstd Python 接口(interface)。

我还没有尝试过这个，但也许可以使用 zstd 的字典生成为 zlib 的 deflate 制作一个好的字典。

最后，我会尝试根据我对短字符串的了解来预处理它们。如果有一种方法可以标记您知道将存在的字符串的内容，那么您已经对它们进行了一些压缩。

关于python - Python 中基于字典的压缩可单独压缩短字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74253737/

python - Python 中基于字典的压缩可单独压缩短字符串

上一篇：reactjs - reportWebVitals 缺少返回类型

下一篇：python - 如何 reshape 数据框并扩展行数