python - 如何在Python中删除正则表达式(re)的重复结果

标签 python expression

有一个字符串:

str = 'Please Contact Prof. Zheng Zhao: <a href="mailto:<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d3a9bbb6bdb4fda993abababfdb0bcbe" rel="noreferrer noopener nofollow">[email protected]</a>">Zheng<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96b8ccd6eeeeeeb8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a></a> for details, or our HR: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b7ddd8dfd999c0dedbdbf7cfcfcf99d4d8da" rel="noreferrer noopener nofollow">[email protected]</a>'

我想解析该字符串中的所有电子邮件,所以我设置:

p = r'[\w\.]+@[\w\.]+'
re.findall(p, str)

结果是:

['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96ecfef3f8f1b8ecd6eeeeeeb8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a>', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="72281a171c155c28320a0a0a5c111d1f" rel="noreferrer noopener nofollow">[email protected]</a>', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dab0b5b2b4f4adb3b6b69aa2a2a2f4b9b5b7" rel="noreferrer noopener nofollow">[email protected]</a>']

显然,第一个和第二个是重复的。我们如何防止这种情况发生?

最佳答案

您可以使用删除重复项。 set 就像一个无序的list,不能包含重复项。在这种情况下,您不关心大小写,因此将结果设置为小写可以让您正确检查重复项。

import re

s = 'Please Contact Prof. Zheng Zhao: <a href="mailto:<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f68c9e939891d88cb68e8e8ed895999b" rel="noreferrer noopener nofollow">[email protected]</a>"><a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="045e6c616a632a5e447c7c7c2a676b69" rel="noreferrer noopener nofollow">[email protected]</a></a> for details, or our HR: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="046e6b6c6a2a736d6868447c7c7c2a676b69" rel="noreferrer noopener nofollow">[email protected]</a>'

p = r'[\w\.]+@[\w\.]+'
list(set(result.lower() for result in re.findall(p, s)))

关于python - 如何在Python中删除正则表达式(re)的重复结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46312503/

相关文章:

c# - .NET 4 和 .NET 4.6.2 之间的表达式主体差异

python - 三引号的正则表达式

python - 连接到 Docker 容器上的 PostgreSQL 数据库

python - 如何安全地写入文件?

linq - 表达式类型.引用

reporting-services - SSRS颜色表达

python - 溢出错误: size does not fit in an int

python - 从 JSON 文件访问特定值时出错

c# - 类型成员的表达式导致不同的表达式(MemberExpression、UnaryExpression)

c# - 从表达式中的集合访问嵌套属性