Javax DocumentBuilder 生成 “double-UTF-8’ ed” 字符集编码

我有一个 Java DOM Document，MyFilter 已被重写。从日志输出中我知道 Document 的内容仍然正确。我使用以下几行将 theDocument 转换为 List<String>，以通过接口(interface)将其传回:

Transformer transformer = TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));

使用 org.apache.commons.io.FileUtils 从该文件复制方法调用过滤器:

List<String> lines = FileUtils.readLines(source, "UTF-8");
if (filters != null) {
    for (final MyFilter filter : filters) {
        lines = filter.filter(lines);
    }
}
FileUtils.writeLines(destination, "UTF-8", lines);

这在我的机器上工作得很好(我可以调试它)，但在其他机器上运行代码时，任何非 ASCII 字符都会重复地得到双 UTF-8 编码(例如， Größe 变成 GrÃ¶ÃŸe )。该代码在 Tomcat 中运行的 Web 应用程序中执行。我确信它们的配置不同，但我想要的是在任何配置上获得未损坏的结果。

有什么想法我可能会错过吗？

最佳答案

当您创建了 Document 对象后，您必须读取它的内容。

之后，您必须使用 DOM 标准为此目的提供的 LSSerializer 接口(interface)将其写入文件。

默认情况下，LSSerializer 生成不带空格或行的 XML 文档休息。因此，输出看起来不太漂亮，但实际上更适合由另一个程序解析，因为它没有不必要的空白。
如果你想要空白，你可以在创建序列化器后使用另一个魔法咒语:

ser.getDomConfig().setParameter("format-pretty-print", true);

代码片段如下:

private String getContentFromDocument(Document doc) {
    String content;

    DOMImplementation impl = doc.getImplementation();
    DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");

    LSSerializer ser = implLS.createLSSerializer();
    ser.getDomConfig().setParameter("format-pretty-print", true);
    content = ser.writeToString(doc);

    return content;
}

获得字符串内容后，您可以将其写入文件，例如:

public void writeToXmlFile(String xmlContent) {
    File theDir = new File("./output");
    if (!theDir.exists())
        theDir.mkdir();

    String fileName = "./output/" + this.getClass().getSimpleName() + "_"
            + Calendar.getInstance().getTimeInMillis() + ".xml";

    try (OutputStream stream = new FileOutputStream(new File(fileName))) {
        try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
            out.write(xmlContent);
            out.write("\n");
        }
    } catch (IOException ex) {
        System.err.println("Cannot write to file!" + ex.getMessage());
    }
}

顺便说一句:

您是否尝试过更轻松地获取 Document 对象，例如:

DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();    
Document doc = builder.parse(new File(fileName));

你也可以试试这个。解析 xml 文件应该足够了。

关于Javax DocumentBuilder 生成 “double-UTF-8’ ed” 字符集编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35752242/

Javax DocumentBuilder 生成 “double-UTF-8’ ed” 字符集编码

上一篇：java - 使用 Java 创建一个 Restful API，返回 JSON 对象作为响应

下一篇：java - 如何更改 JLabel 和 JButton 的颜色