python - 从 HTML 文档中提取 CSS 和 HTML 注释 (Python)

标签 python html css beautifulsoup

我正在使用Python将HMTL代码传递到BeautifulSoup中,并且我的输出因HTML注释而出错。我有这个 python 脚本来删除 HTML 注释,但它无法删除嵌套在 CSS 注释中的 HTML 注释。

我的代码:

from bs4 import BeautifulSoup, Comment   

   input_text = ""

   for line in open('output.txt'):
           input_text+=line

  soup = BeautifulSoup(input_text)
  comments = soup.findAll(text=lambda text:isinstance(text, Comment))
  [comment.extract() for comment in comments]
  print soup

例如,它会从我的测试输入中删除所有 HTML 注释,除了:

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} -->

下面是来自输入的一段代码,其中包括运行我的脚本后成功删除的 2 条注释,以及未删除的注释:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medium)"> <!--[if !mso]><style>v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style><![endif]--><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> 

我不确定首先删除 CSS 注释的最佳方法。我不需要费心删除 CSS 注释的内容,只需删除/* */即可,因为其余部分应该用它嵌套的 HTML 注释来删除

最佳答案

我解决了我的问题。我使用正则表达式删除了它们,对于任何好奇的人,这是我的新代码:

from bs4 import BeautifulSoup, Comment
import re

input_text = ""

for line in open('output.txt'):
    input_text+=line

#extract all CSS comments
text = re.sub('\/*', '', input_text)
text = re.sub('\*/', '', text)

soup = BeautifulSoup(text)

#extract all HTML comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]

print soup

关于python - 从 HTML 文档中提取 CSS 和 HTML 注释 (Python),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30743633/

相关文章:

html - Css 左对齐,居中和右对齐,不起作用

html - Safari 与 Firefox : Strange UL "padding"

html - 我需要一些关于正确定位和显示 CSS 的建议

html - 具有固定宽度列的表格 - 不指定表格宽度

python - tkinter 在弹出窗口中显示值

python - 斜线的形态闭合

php - CSS 仅在 Firefox 中显示,其他不显示

python - 如何抽象调用这个具体 Thread 类的方式?

python - 从字典创建或分配变量

html - 我正在修复表格的前 2 列,但设计不正确