python - 在Python中使用正则表达式解析电子邮件 header

标签 python regex email-parsing

我是一个Python初学者,试图从电子邮件标题中提取数据。我在一个文本文件中有数千封电子邮件,我想从每封邮件中提取发件人地址、收件人地址和日期,并将其写入新文件中以分号分隔的单个行。

这很丑陋,但这是我想出的:

import re

emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results

resultsList = []

for line in emails:
    if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
        newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
        if newMessage:
            resultsList.append("\n")        
    if "From: " in line:
        address = re.findall(r'[\w.-]+@[\w.-]+', line)
        if address:
            resultsList.append(address)
            resultsList.append(";")
    if "To: " in line:
        if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
            address = re.findall(r'[\w.-]+@[\w.-]+', line)
            if address:
                for person in address:
                    resultsList.append(person)
                    resultsList.append(";")
    if "Date: " in line: 
            date = re.findall(r'\w\w\w\,.*', line)
            resultsList.append(date)
            resultsList.append(";")

for result in resultsList:
    results.writelines(result)


emails.close()
results.close()

这是我的“demo_text.txt”:

From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="44372b2921262b203d1b75042c2b3029252d286a272b29" rel="noreferrer noopener nofollow">[email protected]</a>
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
        Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
        Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1977767b767d605971766d74787075377a7674" rel="noreferrer noopener nofollow">[email protected]</a>>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
        by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
        Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f29c9d909d968bb29a9d869f939b9edc919d9f" rel="noreferrer noopener nofollow">[email protected]</a> designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
         Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
         Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="490b081078787c640f787b0c7d0c7c7e7c0f0f7b7b7e7b0a0f7c7e7e7f797c08780b7c7909392131672e2b25" rel="noreferrer noopener nofollow">[email protected]</a>>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
        Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="87e9e8e5e8e3fec7efe8f3eae6eeeba9e4e8ea" rel="noreferrer noopener nofollow">[email protected]</a>]
X-Sender: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="402e2f222f243900282f342d21292c6e232f2d" rel="noreferrer noopener nofollow">[email protected]</a>
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
 <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aac4c5c8c5ced3eac2c5dec7cbc3c684c9c5c7" rel="noreferrer noopener nofollow">[email protected]</a>>
To: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b4c7dbd9d1d6dbd0cdeb85f4dcdbc0d9d5ddd89ad7dbd9" rel="noreferrer noopener nofollow">[email protected]</a>, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7e0d11131b1c111a07214c3e19131f1712501d1113" rel="noreferrer noopener nofollow">[email protected]</a>, <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="37046859585558535e5244774e565f58581954585a195645" rel="noreferrer noopener nofollow">[email protected]</a>
Bcc: 
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7816171a171c013810170c15191114561b1715" rel="noreferrer noopener nofollow">[email protected]</a>

输出为:

<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ff8c90929a9d909b86a0cebf97908b929e9693d19c9092" rel="noreferrer noopener nofollow">[email protected]</a>;<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b7c4d8dad2d5d8d3cee885f7d0dad6dedb99d4d8da" rel="noreferrer noopener nofollow">[email protected]</a>;<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dcef83b2b3beb3b8b5b9af9ca5bdb4b3b3f2bfb3b1f2bdae" rel="noreferrer noopener nofollow">[email protected]</a>;Mon, 15 Jan 2007 18:13:43 +0000;

除了 demo_text.txt 中的“From:”字段(第 24 行)有一个换行符之外,此输出会很好,因此我错过了“[email protected]”。

我不确定如何告诉我的代码跳过换行符并仍然在 From: 标记中查找电子邮件地址。

更一般地说,我确信有很多更明智的方法来完成这项任务。如果有人能指出我正确的方向,我将不胜感激。

最佳答案

您的演示文本实际上是 mbox 格式,可以使用 mailbox 模块中的适当对象完美处理:

from mailbox import mbox
import re

PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\@[0-9A-Za-z._-]+")

mymbox = mbox("demo.txt")
for email in mymbox.values():
    from_address = PAT_EMAIL.findall(email["from"])
    to_address = PAT_EMAIL.findall(email["to"])
    date = [ email["date"], ]
    print ";".join(from_address + to_address + date)

关于python - 在Python中使用正则表达式解析电子邮件 header ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15054503/

相关文章:

python - 编辑主键作为枚举的记录时出现 Flask-admin 错误

regex - .Net正则表达式(Regex)

java - 在 Java 中解析字符串的有效方法是什么?

email - nodejs、redis 和 mailparser 无法解析电子邮件

python - Pyramid :从列表生成 json View

python - 嵌套 python 字典中的字符串替换/格式化占位符值

email-parsing - 使用 PHP mime 邮件解析器

php - 如何在收到电子邮件时实时解析电子邮件

python - gspread - get_all_values() 返回一个空列表

MYSQL 正则表达式匹配方括号之间的任何单词