我需要使用 Jsoup 解析器解析以下 HTML 内容。 要求是消除一些标签并得到以下输出。 我无法使用以下代码获得所需的输出
输入:
<html>
<head>
<style type=\ "text/css\">
body {
font: 12px Arial, Helvetica, sans-serif
}
tr {
font: 12px Arial, Helvetica, sans-serif;
padding: 0px 0px 0px 10px
}
</style>
</head>
<body>
<p>hello,<br> <br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>
<img id=\ "logo_GMALE.png\" alt=\ "logo GMALE\" src=\ "https://www.GMALE.ch/logo.png\">
<br><b>Test abc xyz</b><br><br>T +91 98 471 <br>
<a href=\ "mailto:output.test@GMALE.in\">output.test@GMALE.in</a><br><br><b>Département Team</b><br><br><b>GMALE Assurances</b><br>StreetName 2<br>Postbox 2100<br>Country<br><br>GMALE.ch<br><br>This is a private email contents.<br><br>This e-mail transmission
is intended for the named addressee(s) only. Its contents are private, confidential and protected from disclosure and should not be read, copied or disclosed by any other person. If you are not the intended recipient, we kindly ask you to notify the
sender immediately and to delete this e-mail.<br><br>
</body>
</html>
输出:
<p>hello,<br> <br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br>
<br><b>Test abc xyz</b><br><br>T +91 98 471 <br>
到目前为止完成的代码如下:
Document doc = Jsoup.parse(content);
List<Node> childNodes = doc.select("body").get(0).childNodes();
System.out.println("Elements : " + childNodes);
StringBuilder finalContent = new StringBuilder();
for (Node node : childNodes) {
if (node instanceof Element) {
Element subElement = (Element) node;
if (!subElement.tagName().equals("img")) {
finalContent.append(subElement);
}
} else {
TextNode textNode = (TextNode) node;
if(!textNode.getWholeText().startsWith("<a")) {
finalContent.append(textNode);
}
}
}
最佳答案
您的问题可以定义如下:parse body
以下 HTML 并提取所有数据,直到达到 <a href=\ "mailto:output.test@GMALE.in\">
。如果您从这个角度看问题,您可以尝试以下方法:
final Document doc = Jsoup.parse(content);
final Elements elements = doc.select("body > *:not(img)");
final Iterator<Element> iterator = elements.iterator();
final StringBuilder finalContent = new StringBuilder();
Element current;
while (iterator.hasNext() && !(current = iterator.next()).tagName().startsWith("a")) {
finalContent.append(current.toString());
String siblingText = current.nextSibling().attr("text").trim();
if (!siblingText.isEmpty()) {
finalContent.append(siblingText);
}
}
System.out.println(finalContent);
首先我们选择除 <img>
之外的所有元素带选择器body > *:not(img)
。然后我们迭代所有元素,直到到达列表末尾或到达第一个 a
元素。我们还检查是否存在包含任何内容的同级文本节点 - 这是电话号码的情况,因为它没有放置在任何 HTML 标记内,并且它是 <br>
之一的同级文本节点标签。
运行此示例会生成以下输出:
<p>hello,<br> <br>We need to dispatch the below documents to you. Thanks for your cooperation.<br><br>Best Regards</p><br><br><b>Test_firstname90 Test_lastname90</b><br><br>T +91 98 471<br>
当然,您定义了不同的迭代停止规则,创建此示例是为了给您一个提示。我希望它有帮助。
关于java - Jsoup - 解析选定的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46100785/