我有一个邮箱文件,其中包含超过 50 兆的邮件,邮件之间用如下内容分隔:
从 - 2007 年 7 月 19 日星期四 07:11:55
我想在 Java 中为此构建一个正则表达式,以便一次提取每封邮件消息,因此我尝试使用扫描仪,并使用以下模式作为分隔符:
public boolean ParseData(DataSource data_source) {
boolean is_successful_transfer = false;
String mail_header_regex = "^From\\s";
LinkedList<String> ip_addresses = new LinkedList<String>();
ASNRepository asn_repository = new ASNRepository();
try {
Pattern mail_header_pattern = Pattern.compile(mail_header_regex);
File input_file = data_source.GetInputFile();
//parse out each message from the mailbox
Scanner scanner = new Scanner(input_file);
while(scanner.hasNext(mail_header_pattern)) {
String current_line = scanner.next(mail_header_pattern);
Matcher mail_matcher = mail_header_pattern.matcher(current_line);
//read each mail message and extract the proper "received from" ip address
//to put it in our list of ip's we can add to the database to prepare
//for querying.
while(mail_matcher.find()) {
String message_text = mail_matcher.group();
String ip_address = get_ip_address(message_text);
//empty ip address means the line contains no received from
if(!ip_address.trim().isEmpty())
ip_addresses.add(ip_address);
}
}//next line
//add ip addresses from mailbox to database
is_successful_transfer = asn_repository.AddIPAddresses(ip_addresses);
}
//error reading file--unsuccessful transfer
catch(FileNotFoundException ex) {
is_successful_transfer = false;
}
return is_successful_transfer;
}
这看起来应该可以工作,但是每当我运行它时,程序就会挂起,可能是因为它找不到模式。同样的正则表达式在 Perl 中使用相同的文件,但在 Java 中它总是卡在 String current_line = Scanner.next(mail_header_pattern);
这个正则表达式正确还是我错误地解析了文件?
最佳答案
我倾向于更简单的东西,只需阅读行,如下所示:
while(scanner.hasNextLine()) {
String line = scanner.nextLine();
if (line.matches("^From\\s.*")) {
// it's a new email
} else {
// it's still part of the email body
}
}
关于java - 邮件 header 消息的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6804188/