java - 删除 XML 字符元素中重复的换行符/制表符/空格

<node> test
    test
    test
</node>

我希望我的 XML 解析器读取 <node> 中的字符和:

将换行符和制表符替换为空格，并将多个空格合二为一。结果，文本应该类似于“test test test”。
如果节点包含 XML 编码字符:制表符 (	)、换行符 (
) 或空格 () - 应保留它们。

我正在尝试下面的代码，但它保留了重复的空格。

  dbf = DocumentBuilderFactory.newInstance();
  dbf.setIgnoringComments( true );
  dbf.setNamespaceAware( namespaceAware );
  db = dbf.newDocumentBuilder();
  doc = db.parse( inputStream );

有什么方法可以做我想做的事吗？

谢谢!

最佳答案

第一部分 - 替换多个空格 - 相对容易，但我认为解析器不会为你做这件事:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

这是困难的部分:

If the node contains XML encoded characters: tabs (	), newlines (
) or whitespaces () - they should be left.

解析器总是会将 " " 转换为 "\t" - 您可能需要编写自己的 XML 解析器。

According to Saxon 的作者:

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

关于java - 删除 XML 字符元素中重复的换行符/制表符/空格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23157039/

java - 删除 XML 字符元素中重复的换行符/制表符/空格

上一篇：java - ANTLR4 中的树语法在哪里？

下一篇：java - 究竟什么时候加载一个类？