我正在构建一个本地事件日历,它采用 RSS 提要和网站抓取并从中提取事件日期。
我之前问过如何从 PHP here 中的文本中提取日期,当时从 MarcDefiant 那里得到了一个很好的答案:
function parse_date_tokens($tokens) {
# only try to extract a date if we have 2 or more tokens
if(!is_array($tokens) || count($tokens) < 2) return false;
return strtotime(implode(" ", $tokens));
}
function extract_dates($text) {
static $patterns = Array(
'/^[0-9]+(st|nd|rd|th|)?$/i', # day
'/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
'/^20[0-9]{2}$/', # year
'/^of$/' #words
);
# defines which of the above patterns aren't actually part of a date
static $drop_patterns = Array(
false,
false,
false,
true
);
$tokens = Array();
$result = Array();
$text = str_word_count($text, 1, '0123456789'); # get all words in text
# iterate words and search for matching patterns
foreach($text as $word) {
$found = false;
foreach($patterns as $key => $pattern) {
if(preg_match($pattern, $word)) {
if(!$drop_patterns[$key]) {
$tokens[] = $word;
}
$found = true;
break;
}
}
if(!$found) {
$result[] = parse_date_tokens($tokens);
$tokens = Array();
}
}
$result[] = parse_date_tokens($tokens);
return array_filter($result);
}
# test
$texts = Array(
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special @ The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
);
$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
echo " " . date('d.m.Y H:i:s', $date) . "\n";
}
但是,该解决方案有一些缺点 - 一方面,它无法匹配日期范围。
我现在正在寻找可以从示例文本中提取日期、时间和日期范围的更复杂的解决方案。
最好的方法是什么?似乎我倾向于一系列正则表达式语句一个接一个地运行以捕捉这些情况。我看不到特别是捕获日期范围的更好方法,但我知道必须有更好的方法来执行此操作。有没有专门用于 PHP 日期解析的库?
日期/日期范围样本,按要求
$dates = [
" Saturday 28th December",
"2013/2014",
"Friday 10th of January",
"Thursday 19th December",
" on Sunday the 15th December at 1 p.m",
"On Saturday December 14th ",
"On Saturday December 21st at 7.30pm",
"Saturday, March 21st, 9.30 a.m.",
"Jan-April 2014",
"January 21st - Jan 24th 2014",
"Dec 30th - Jan 3rd, 2014",
"February 14th-16th, 2014",
"Mon 14 - Wed 16 April, 12 - 2pm",
"Sun 13 April, 8pm",
"Mon 21 - Wed 23 April",
"Friday 25 April, 10 – 3pm",
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special @ The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
];
我目前使用的函数(不是上面的函数)大约有 90% 准确。它可以捕获日期范围,但如果还指定了时间,则困难重重。它使用正则表达式列表并且非常复杂。
更新:2014 年 1 月 6 日
我正在编写执行此操作的代码,使用我原来的一系列正则表达式语句依次运行的方法。我想我已经接近于可以从一段文本中提取几乎任何日期/时间范围/格式的工作解决方案。当我完成后,我会把它贴在这里作为答案。
最佳答案
我认为您可以像下面这样总结问题中的正则表达式。
(?<date_format_1>(?<day>(?i)\b\s*[0-9]+(?:st|nd|rd|th|)?)(?<month>(?i)\b\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|etc))(?<year>\b\s*20[0-9]{2}) ) |
(?<date_format_2>(?&month)(?&day)(?!\s+-)) |
(?<date_format_3>(?&day)\s+of\s+(?&month)) |
(?<range_type_1>(?&month)(?&day)\s+-\s+(?&day))
标志: x
描述
演示
讨论
通过使用递归子模式,您可以降低最终正则表达式的复杂性。
我在 date_format_2
中使用了负先行,因为它会部分匹配 range_type_1
。您可能需要根据您的数据添加更多范围类型。不要忘记检查其他部分以防部分匹配。
另一种解决方案是在不同的字符串变量中构建小的正则表达式,然后在 PHP 中连接它们以构建更大的正则表达式。
关于php - 从 PHP 文本中提取日期、时间和日期范围,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20837939/