java - 压缩unicode字符

我在我的 java 程序中使用 GZIPOutputStream 来压缩大字符串，最后将其存储在数据库中。

我可以看到，在压缩英文文本时，我实现了 1/4 到 1/10 的压缩比(取决于字符串值)。例如，我的原始英文文本是 100kb，那么平均压缩后的文本将在 30kb 左右。

但是当我压缩 unicode 字符时，压缩后的字符串实际上比原始字符串占用更多的字节。比方说，我原来的 unicode 字符串是 100kb，然后压缩后的版本是 200kb。

Unicode字符串示例:"嗨，这是，短信计算测试持续for.Hi这是短"

任何人都可以建议我如何实现对 unicode 文本的压缩吗？以及为什么压缩版本实际上比原始版本大？

我的 Java 压缩代码:

            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            GZIPOutputStream zos = new GZIPOutputStream(baos);

            zos.write(text.getBytes("UTF-8"));
            zos.finish();
            zos.flush();

            byte[] udpBuffer = baos.toByteArray();

最佳答案

Java 的 GZIPOutputStream 使用 Deflate压缩算法来压缩数据。 Deflate 是 LZ77 的组合和 Huffman coding . According to Unicode's Compression FAQ :

Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?

A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that is desired.

To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).

我找到了 this set of Java classes用于 unicode 网站上的 SCSU 压缩，这可能对您有用，但是我找不到可以轻松导入到您的项目中的 .jar 库，但如果您愿意，您可以将它们打包成一个。

关于java - 压缩unicode字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23013654/

java - 压缩unicode字符

上一篇：c# - Java 中的双重检查锁定是否需要 `volatile` 而不是 C#？

下一篇：java - 如何在 Wordnet 中获取 Synset 偏移量以便在 Imagenet 中使用