<node> test
test
test
</node>
我希望我的 XML 解析器读取 <node>
中的字符和:
- 将换行符和制表符替换为空格,并将多个空格合二为一。结果,文本应该类似于“test test test”。
- 如果节点包含 XML 编码字符:制表符 (
	
)、换行符 (

) 或空格 (
) - 应保留它们。
我正在尝试下面的代码,但它保留了重复的空格。
dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringComments( true );
dbf.setNamespaceAware( namespaceAware );
db = dbf.newDocumentBuilder();
doc = db.parse( inputStream );
有什么方法可以做我想做的事吗?
谢谢!
最佳答案
第一部分 - 替换多个空格 - 相对容易,但我认为解析器不会为你做这件事:
InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);
NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Text text = (Text) nodes.item(i);
text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}
// check results
TransformerFactory.newInstance()
.newTransformer()
.transform(new DOMSource(doc), new StreamResult(System.out));
这是困难的部分:
If the node contains XML encoded characters: tabs (
	
), newlines (

) or whitespaces (
) - they should be left.
解析器总是会将 " "
转换为 "\t"
- 您可能需要编写自己的 XML 解析器。
According to Saxon 的作者:
I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.
关于java - 删除 XML 字符元素中重复的换行符/制表符/空格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23157039/