java - 如何将 Windows-1251 文本转换为可读的内容？

我有一个字符串，它由 Jericho HTML 解析器返回并包含一些俄语文本。根据 source.getEncoding() 和相应 HTML 文件的 header ，编码为 Windows-1251。

如何将此字符串转换为可读的内容？

我试过这个:

import java.io.UnsupportedEncodingException;

public class Program {
    public void run() throws UnsupportedEncodingException {
        final String windows1251String = getWindows1251String();
        System.out.println("String (Windows-1251): " + windows1251String);
        final String readableString = convertString(windows1251String);
        System.out.println("String (converted): " + readableString);
    }
    private String convertString(String windows1251String) throws UnsupportedEncodingException {
        return new String(windows1251String.getBytes(), "UTF-8");
    }
    private String getWindows1251String() {
        final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
        return new String(bytes);
    }
    public static void main(final String[] args) throws UnsupportedEncodingException {
        final Program program = new Program();
        program.run();
    }
}

变量 bytes 包含我的调试器中显示的数据，它是 net.htmlparser.jericho.Element.getContent().toString().getBytes()。我只是将该数组复制并粘贴到此处。

这不起作用 - readableString 包含垃圾。

我该如何修复它，我。 e.确保 Windows-1251 字符串被正确解码？

更新 1 (30.07.2015 12:45 MSK): 当将 convertString 中的调用中的编码更改为 Windows-1251 时，没有什么变化。请参见下面的屏幕截图。

更新 2: 另一种尝试:

更新 3 (30.07.2015 14:38):我需要解码的文本对应于下面显示的下拉列表中的文本。

更新 4 (30.07.2015 14:41):编码检测器(代码见下文)表示编码不是 Windows-1251，而是 UTF-8.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    System.out.println("Detected encoding: " + encoding);
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

最佳答案

我通过修改从网站读取文本的代码片段解决了这个问题。

private String readContent(final String urlAsString) {
    final StringBuilder content = new StringBuilder();
    BufferedReader reader = null;
    InputStream inputStream = null;
    try {
        final URL url = new URL(urlAsString);
        inputStream = url.openStream();
        reader =
            new BufferedReader(new InputStreamReader(inputStream);

        String inputLine;
        while ((inputLine = reader.readLine()) != null) {
            content.append(inputLine);
        }
    } catch (final IOException exception) {
        exception.printStackTrace();
    } finally {
        IOUtils.closeQuietly(reader);
        IOUtils.closeQuietly(inputStream);
    }
    return content.toString();
}

我换了行

new BufferedReader(new InputStreamReader(inputStream);

到

new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));

然后就成功了。

关于java - 如何将 Windows-1251 文本转换为可读的内容？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31720244/

java - 如何将 Windows-1251 文本转换为可读的内容？

上一篇：java - 使用循环设置 Bean 的属性

下一篇：java - 我无法从 JDialog 更新 Jcombobox(通过模型)