我正在尝试在 scrapy 中使用正则表达式来查找页面上的所有电子邮件地址。
我正在使用此代码:
item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)
这几乎完美地工作:它抓取所有电子邮件并将它们提供给我。然而我想要的是:即使有多个相同的电子邮件地址,它在实际解析之前也不会重复。
我收到这样的回复(这是正确的):
{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d7b5bebbbbaee1e1e197a4a3b6b9b1b8a5b3f9b2b3a2" rel="noreferrer noopener nofollow">[email protected]</a>',
'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a0c3c1ced4cfd2c6c1cdc9ccc9c5d3e0d3d4c1cec6cfd2c48ec5c4d5" rel="noreferrer noopener nofollow">[email protected]</a>',
'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="690a08071d061b0f08040005000c1a291a1d08070f061b0d470c0d1c" rel="noreferrer noopener nofollow">[email protected]</a>',
'cantorfam<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="11787d787462516265707f777e63753f747564" rel="noreferrer noopener nofollow">[email protected]</a>',
'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d0b020219081f401e190c030b021f094001020a022d5f15431d030a" rel="noreferrer noopener nofollow">[email protected]</a>']}
但是我只想显示唯一的地址
{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a4c6cdc8c8dd929292e4d7d0c5cac2cbd6c08ac1c0d1" rel="noreferrer noopener nofollow">[email protected]</a>',
'cantorfam<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="01686d686472417275606f676e73652f646574" rel="noreferrer noopener nofollow">[email protected]</a>',
'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="096f66667d6c7b247a7d68676f667b6d2465666e66493b712779676e" rel="noreferrer noopener nofollow">[email protected]</a>']}
如果您想添加如何仅收集电子邮件而不是其他内容
'<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2e4841415a4b5c035d5a4f4048415c4a03424149416e1c56005e4049" rel="noreferrer noopener nofollow">[email protected]</a>'
这也很有帮助。
谢谢大家!
最佳答案
以下是摆脱欺骗的方法和 '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="75131a1a011007580601141b131a071158191a121a35470d5b051b12" rel="noreferrer noopener nofollow">[email protected]</a>'
- 输出中类似的东西:
import re
p = re.compile(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="98faf1f4f4e1aeaeaed8ebecf9f6fef7eafcb6fdfced" rel="noreferrer noopener nofollow">[email protected]</a>',\n '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7b181a150f14091d1a161217121e083b080f1a151d14091f551e1f0e" rel="noreferrer noopener nofollow">[email protected]</a>',\n '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f497959a809b869295999d989d9187b48780959a929b8690da919081" rel="noreferrer noopener nofollow">[email protected]</a>',\n '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e380828d978c9185828e8a8f8a8690a39097828d858c9187cd868796" rel="noreferrer noopener nofollow">[email protected]</a>',\n '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="670108081302154a14130609010815034a0b08000827551f49170900" rel="noreferrer noopener nofollow">[email protected]</a>']}"
print(set(p.findall(test_str)))
请参阅Python demo
正则表达式看起来像
[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
参见demo
否定前瞻 (?![\w.-]*\.(?:png|jpe?g|gif)\b)
将禁止与 png
的所有匹配, jpg
等单词末尾的扩展(\b
是单词边界,在本例中,它是尾随单词边界)。
可以使用 set
轻松删除欺骗内容- 这是这里最不麻烦的部分。
最终解决方案:
item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))
关于python - 删除重复的电子邮件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36658427/