我的 html 解析器有问题。我将充满 html 代码的电子邮件转换为漂亮的干净文本,除了“< style > content ”部分,它完全忽略了它,我不知道我做错了什么:
# Remove any HTML code from our raw content
htmlparse = re.sub(r'<.*?>', '', clean) \
.replace(" ", '') \
.replace("é", 'é') \
.replace("è", 'è') \
clean_email = htmlparse
它实际应该删除的是:
<style> .MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; } .DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none; padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; } .Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; } .GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; } .DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; } .GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent; height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; } .li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; } .TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px; color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px; padding-right: 5px; background-color: #BBD8FF; } .TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif; vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; } .Separator { background-repeat: repeat-x; background-position: center; background: #666666; } .tableDetail { padding: 0 0 0 0; spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; } .style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; } </style>
它实际上做的是删除样式和/style,但将样式表的整个垃圾留在解析的文件中......
.MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; } .DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none; padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; } .Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; } .GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; } .DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; } .GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent; height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; } .li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; } .TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px; color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px; padding-right: 5px; background-color: #BBD8FF; } .TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif; vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; } .Separator { background-repeat: repeat-x; background-position: center; background: #666666; } .tableDetail { padding: 0 0 0 0; spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; } .style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; } Hello, this is a test mail.
有人可以帮助我吗?
先谢谢大家了, 问候
最佳答案
首先删除样式本身,然后在第二遍中执行您想要执行的操作。
import re
some = """
<style>.MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif;
vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; }
.DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none;
padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; }
.Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0;
\ background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; }
.GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; }
.DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; }
.GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent;
height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif;
padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; }
.li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; }
.TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px;
color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px;
padding-right: 5px; background-color: #BBD8FF; }
.TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif;
vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; }
.Separator { background-repeat: repeat-x; background-position: center; background: #666666; }
.tableDetail { padding: 0 0 0 0;
spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; }
.style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom:
0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; }
</style>
<h1>Hello, this is a test mail.</h1>
"""
some1 = re.sub(r'<style>.*</style>', '', some, flags=re.DOTALL)
print some1
结果:
I have no name!@sla-334:~/stack_o$ python stack_o_html.py
<h1>Hello, this is a test mail.</h1>
现在,用这个 html 做你想做的事。
关于python - htmlparse 无法清除 <style>,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29509216/