python - 使用 python 和 lxml 模块从 html 中删除所有 javascript 标签和样式标签

我正在使用 http://lxml.de/ 解析一个 html 文档。图书馆。到目前为止，我已经弄清楚如何从 html 文档中去除标签 In lxml, how do I remove a tag but retain all contents?但是该帖子中描述的方法会留下所有文本，在不删除实际脚本的情况下剥离标签。我还找到了对 lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html 的类引用但这对于如何实际使用该类来清理文档来说一清二楚。任何帮助，也许一个简短的例子会对我有所帮助!

最佳答案

下面是一个做你想做的事的例子。对于 HTML 文档，Cleaner是比使用 strip_elements 更好的通用解决方案，因为在这种情况下，您想要删除的不仅仅是 <script>标签;您还想摆脱 onclick=function() 之类的东西其他标签的属性。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

您可以在 lxml.html.clean.Cleaner documentation 中获取可以设置的选项列表。 ;一些选项可以设置为 True或 False (默认)和其他人采用如下列表:

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

注意kill和remove的区别:

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

关于python - 使用 python 和 lxml 模块从 html 中删除所有 javascript 标签和样式标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8554035/

python - 使用 python 和 lxml 模块从 html 中删除所有 javascript 标签和样式标签

上一篇：python - 处理 argparse 输入中的空格

下一篇：Python导致: IOError: [Errno 28] No space left on device: '../results/32766.html' on disk with lots of space