有一个字符串:
str = 'Please Contact Prof. Zheng Zhao: <a href="mailto:<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d3a9bbb6bdb4fda993abababfdb0bcbe" rel="noreferrer noopener nofollow">[email protected]</a>">Zheng<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96b8ccd6eeeeeeb8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a></a> for details, or our HR: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b7ddd8dfd999c0dedbdbf7cfcfcf99d4d8da" rel="noreferrer noopener nofollow">[email protected]</a>'
我想解析该字符串中的所有电子邮件,所以我设置:
p = r'[\w\.]+@[\w\.]+'
re.findall(p, str)
结果是:
['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96ecfef3f8f1b8ecd6eeeeeeb8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a>', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="72281a171c155c28320a0a0a5c111d1f" rel="noreferrer noopener nofollow">[email protected]</a>', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dab0b5b2b4f4adb3b6b69aa2a2a2f4b9b5b7" rel="noreferrer noopener nofollow">[email protected]</a>']
显然,第一个和第二个是重复的。我们如何防止这种情况发生?
最佳答案
您可以使用集
删除重复项。 set
就像一个无序的list
,不能包含重复项。在这种情况下,您不关心大小写,因此将结果设置为小写可以让您正确检查重复项。
import re
s = 'Please Contact Prof. Zheng Zhao: <a href="mailto:<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f68c9e939891d88cb68e8e8ed895999b" rel="noreferrer noopener nofollow">[email protected]</a>"><a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="045e6c616a632a5e447c7c7c2a676b69" rel="noreferrer noopener nofollow">[email protected]</a></a> for details, or our HR: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="046e6b6c6a2a736d6868447c7c7c2a676b69" rel="noreferrer noopener nofollow">[email protected]</a>'
p = r'[\w\.]+@[\w\.]+'
list(set(result.lower() for result in re.findall(p, s)))
关于python - 如何在Python中删除正则表达式(re)的重复结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46312503/