python - 清除 HTML 中的非正文文本

标签 python html css regex parsing

我想获取 HTML 文件中的一些电子邮件的文本。有时,HTML 包含额外的信息,如 css 样式。这是一个文件示例:

< html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3-html40"><head><!-- Template generated by Exclaimer Mail Disclaimers on 09:48:42 Donnerstag, 2 Mai 2019 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">P.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
LI.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
DIV.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
TABLE.ImprintUniqueIDTable {
    MARGIN: 0cm 0cm 0pt
}
DIV.Section1 {
    page: Section1
}
</style>
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:#0563C1;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:#954F72;
    text-decoration:underline;}
p
    {mso-style-priority:99;
    mso-margin-top-alt:auto;
    margin-right:0cm;
    mso-margin-bottom-alt:auto;
    margin-left:0cm;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
span.E-MailFormatvorlage18
    {mso-style-type:personal-compose;
    font-family:"Arial",sans-serif;
    color:black;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DE" link="#0563C1" vlink="#954F72">
<p class="ImprintUniqueID"></p>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;border-collapse:collapse">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">i.A.
<b>mdfdddfdf</b>&nbsp;<b>fdfdfd</b><o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:#7D7D7D">Euf<o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&nbsp;<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:18.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:18.75pt">
<p class="MsoNormal"><a href="randomURL" target="''"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:blue;text-decoration:none"><img border="0" width="173" height="25" id="_x0000_i1025" src="cid:image001.jpg@01D500CC.38133950" alt="RANDOM-LOGO"></span></a><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p></o:p></span></p>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr style="height:39.0pt">
<td style="padding:0cm 0cm 0cm 0cm;height:39.0pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">OLRAIT ....<br>
a name street Strasse<br>
51766 somewhere in , Germany<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td nowrap="" style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">Tel:<o:p></o:p></span></p>
</td>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&#43;another number<o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">my number</span><span style="color:black"><o:p></o:p></span></p>
</td>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&#43;a number</span><span style="color:black"><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<table class="MsoNormalTable" border="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:.75pt .75pt .75pt .75pt"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><a href="mailto:randomEMAIL@randomEMAIL.com" title="Click to send email to randomEMAIL"><span style="color:blue"> randomEMAIL </span></a><o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><a href="randomURL" title=""><span style="color:blue">www.fawema.com</span></a><o:p></o:p></span></p>
</td>
</tr>
<tr style="height:7.5pt">
<td style="padding:0cm 0cm 0cm 0cm;height:7.5pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">Geschäftsführer: someNAMESr<br>
Handelsregister: randomURL 71761<br>
<br>
Bitte beachten Sie unsere Hinweise zum Datenschutz unter: <a href="randomURLIzo6I7nmlDCWwpK2F8C-adMC8Us" title="">
<span style="color:blue">www.randomURL.com</span></a><br>
Please find our information about data protection on: <a href="https://randomURL.com/index.php?atp_str=GD7FHAaZldBtYu4ZbiuQ5j0ju1Bz3V_-WJVhfSIvwKpNc7PkjwxvXWJ9N1ZYj4wxICa635o8b7ZYcrVXOGSir15tnxi2soe_ByWg05vb9Nx5D7wE08-DCfJ0za-gv6SH3MYY3OGuT5-ZO-eXZ1T5GbdEbyr5OE5_ofzIU4fCytSlKwS7OVZ6MrqVaMfXfc1AHnwigCkcGUgERcuUj8guuA8BY3huRL1aHmjQWKi1uHwr4CfaTN2qQVZhD9WLXQiuNEItrlQyjzk_NrekGaVk2lhC5JkeAamHHgtEQkvrXEBHVCM6OiMxYjU0YTA3ZDVhMWIjOjojqVc_-neMexkb6m2m7TQMYw" title="">
<span style="color:blue">www.randomURL.com</span></a><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<p></p>
<p class="ImprintUniqueID">&nbsp;</p>
<font size="1" face="Arial">
<hr>
</font>
<p class="ImprintUniqueID"><br>
<font size="1" face="Arial">Diese E-Mail kann vertrauliche und/oder rechtlich geschützte Informationen enthalten. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten
 Sie diese E-Mail und alle enthaltenen Anhänge. Das Öffnen der Anhänge, das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail und des Anhanges sind nicht gestattet.<br>
<br>
</font><font size="1" face="Arial"></p>
<hr>
</font><font face="Arial"><br>
<font size="1">This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
 or distribution of the material in this e-mail is strictly forbidden.</font></font>
<p></p>
</body>
</html>

`

我通过以下代码使用 Beautiful soup 阅读了文本:

f=codecs.open(file, 'rb')
document= BeautifulSoup(f.read().decode('utf-8', 'ignore')).get_text().strip()

之后,我用漂白剂清洗了它:

document = bleach.clean(document, strip=True)

但是,它不会删除此 css 样式文本:

<style type="text/css">P.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
LI.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
DIV.ImprintUniqueID {
    MARGIN: 0cm 0cm 0pt
}
TABLE.ImprintUniqueIDTable {
    MARGIN: 0cm 0cm 0pt
}
DIV.Section1 {
    page: Section1
}
</style>

我尝试使用正则表达式来清理它但不起作用:

regex = '(?s)<style>(.*?)<\/style>'
pattern = re.compile(regex)
document_clean = re.sub(pattern, '', document)

有什么想法吗?

最佳答案

请注意 <style>标签有一个 type属性,即:

<style type="text/css">P.ImprintUniqueID {

因此,您需要对正则表达式更加宽容一些。例如:

regex = '(?s)<style[>\s](.*?)</style>'

这确保您匹配了 style标签而不是某些以 style 开头的标签例如 <style2> (我编造了这个标签名称)。

关于python - 清除 HTML 中的非正文文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58816678/

相关文章:

css - 在 bs4 中如何处理溢出容器的按钮?

python - 如何在 Python 中更好地控制循环增量?

python - 在 django 中组织 url

python - BigQuery Python API : Preserve null fields during extract_table job

html - 如何在 HTML 中突出显示源代码?

jquery - 在以下 HTML 中使用 jquery 更改 CSS 样式

python - 检查是否返回结果并分配给变量

jquery - 带有弹出窗口的 HTML 图像映射。使用 jquery 动画并回退到 css

javascript - 使用 javascript 缺少内容更改 html 内容

jquery - 使用 CSS 动画 slider