python - 删除重复的电子邮件

标签 python regex email web-scraping scrapy

我正在尝试在 scrapy 中使用正则表达式来查找页面上的所有电子邮件地址。

我正在使用此代码:

    item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)

这几乎完美地工作:它抓取所有电子邮件并将它们提供给我。然而我想要的是:即使有多个相同的电子邮件地址,它在实际解析之前也不会重复。

我收到这样的回复(这是正确的):

{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d7b5bebbbbaee1e1e197a4a3b6b9b1b8a5b3f9b2b3a2" rel="noreferrer noopener nofollow">[email protected]</a>',
           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a0c3c1ced4cfd2c6c1cdc9ccc9c5d3e0d3d4c1cec6cfd2c48ec5c4d5" rel="noreferrer noopener nofollow">[email protected]</a>',
           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="690a08071d061b0f08040005000c1a291a1d08070f061b0d470c0d1c" rel="noreferrer noopener nofollow">[email protected]</a>',
           'cantorfam<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="11787d787462516265707f777e63753f747564" rel="noreferrer noopener nofollow">[email protected]</a>',
           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d0b020219081f401e190c030b021f094001020a022d5f15431d030a" rel="noreferrer noopener nofollow">[email protected]</a>']}

但是我只想显示唯一的地址

{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a4c6cdc8c8dd929292e4d7d0c5cac2cbd6c08ac1c0d1" rel="noreferrer noopener nofollow">[email protected]</a>',
           'cantorfam<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="01686d686472417275606f676e73652f646574" rel="noreferrer noopener nofollow">[email protected]</a>',
           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="096f66667d6c7b247a7d68676f667b6d2465666e66493b712779676e" rel="noreferrer noopener nofollow">[email protected]</a>']}

如果您想添加如何仅收集电子邮件而不是其他内容

'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2e4841415a4b5c035d5a4f4048415c4a03424149416e1c56005e4049" rel="noreferrer noopener nofollow">[email protected]</a>'

这也很有帮助。

谢谢大家!

最佳答案

以下是摆脱欺骗的方法和 '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="75131a1a011007580601141b131a071158191a121a35470d5b051b12" rel="noreferrer noopener nofollow">[email protected]</a>' - 输出中类似的东西:

import re
p = re.compile(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="98faf1f4f4e1aeaeaed8ebecf9f6fef7eafcb6fdfced" rel="noreferrer noopener nofollow">[email protected]</a>',\n           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7b181a150f14091d1a161217121e083b080f1a151d14091f551e1f0e" rel="noreferrer noopener nofollow">[email protected]</a>',\n           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f497959a809b869295999d989d9187b48780959a929b8690da919081" rel="noreferrer noopener nofollow">[email protected]</a>',\n           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e380828d978c9185828e8a8f8a8690a39097828d858c9187cd868796" rel="noreferrer noopener nofollow">[email protected]</a>',\n           '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="670108081302154a14130609010815034a0b08000827551f49170900" rel="noreferrer noopener nofollow">[email protected]</a>']}"
print(set(p.findall(test_str)))

请参阅Python demo

正则表达式看起来像

[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^

参见demo

否定前瞻 (?![\w.-]*\.(?:png|jpe?g|gif)\b)将禁止与 png 的所有匹配, jpg单词末尾的扩展(\b是单词边界,在本例中,它是尾随单词边界)。

可以使用 set 轻松删除欺骗内容- 这是这里最不麻烦的部分。

最终解决方案:

item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))

关于python - 删除重复的电子邮件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36658427/

相关文章:

python - 在Python中比较包含小数的字符串

python - 使用http.client正确下载.gz

java - 无法匹配 Java 中的正则表达式

javascript - 获取从行首到任何内容的除换行符之外的所有空白?

ruby-on-rails - 为什么我的身份验证电子邮件不起作用?我得到一个 "AuthenticationError"

php - 我可以使用 phpmailer 添加自定义 Message-ID 和 In-Reply-To header 吗?

email - Maven 电子邮件插件

python - Apache 与 Django/Matplotlib 应用程序一起挂起

python - 应用函数来操作 Python Pandas DataFrame 组

javascript - 如何使用正则表达式模式验证版本号