我使用白名单如下:
Document doc = Jsoup.parse(urls[0], 5000);
if (doc != null){
Whitelist wl = Whitelist.basicWithImages();
// wl.preserveRelativeLinks(false);
Cleaner cleaner = new Cleaner(wl);
cleanedDoc=cleaner.clean(doc);
if (cleanedDoc != null){
whiteListedHtml = cleanedDoc.html();
}
}
}catch(IOException e){
Log.d(TAG,"exception="+e.getMessage());
}
现在这与我想做的事情非常接近,除了: 有一些 div 标签,其类具有“nav”或“ad”并且正在填充页面 与垃圾。例如,我想保留 div 标签,但如果类中碰巧出现“nav”或“ad”,则不会。
这让我考虑对白名单进行子类化...... RTFM http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html我懂了 addTag() 和removeTag() (不知何故removeTag() 不可用,但这是另一个问题)。我真正想做的是当且仅当标签的类包含字符串中的某些值(例如“ad”或“nav”)时才删除。 唯一看起来有希望的方法是:
protected boolean isSafeTag(String tag)
Test if the supplied tag is allowed by this whitelist
Parameters:
tag - test tag
Returns:
true if allowed
那么如何提取该字符串的类值进行测试呢?无论如何,是否可以在不子类化白名单的情况下进行此检查?现在我正在尝试这个:
protected boolean isSafeTag(String tag){
Boolean retVal = true;
Document doc = Jsoup.parse(tag);
if (doc.getAllElements().size()>0){
Element e = doc.getAllElements().get(0);
String attribute = e.attr("class");
if ((attribute != null) && (attribute.contains("ad") || attribute.contains("nav"))){
retVal = false;
}
}
if (retVal == false)
return false
else
return super.isSafeTag(tag);
最佳答案
Is there anyway to do this check without subclassing whitelist?
一种方法是删除不需要的 div,然后清理生成的文档。
示例代码
String html = "<html><head></head><body><div class=\"ad\">Remove</div><p>Hello word</p><div>Don't remove</div></body></html>";
System.out.println("** BEFORE:\n" + html);
Document dirtyDoc = Jsoup.parse(html);
for (Element div : dirtyDoc.select("div.ad, div.nav")) {
div.remove();
}
Whitelist whitelist = Whitelist //
.basicWithImages() // your original choosen list
.addTags("div"); // Without this line, any div will be removed
Cleaner cleaner = new Cleaner(whitelist);
Document cleanedDoc = cleaner.clean(dirtyDoc);
System.out.println("\n** AFTER:\n" + cleanedDoc.html());
输出
** BEFORE:
<html><head></head><body><div class="ad">Remove</div><p>Hello word</p><div>Don't remove</div></body></html>
** AFTER:
<html>
<head></head>
<body>
<p>Hello word</p>
<div>
Don't remove
</div>
</body>
</html>
Jsoup 1.8.3
关于jSoup 如何将匹配某些类模式的标签列入白名单?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36144857/