python - 匹配包含月份的日期范围的正则表达式

标签 python regex string date

我需要匹配一个字符串来确定它是否是有效的日期范围,我的字符串可以包括文本中的月份和数字中的年份,没有特定的顺序(没有固定的格式,如 MM-YYYY-DD 等)。

A valid string could be:

2016 年 2 月 - 2019 年 3 月

2015 年 9 月至 2019 年 8 月

2015 年 4 月至今

2018 年 9 月至今

Invalid string:

乔治梅森大学 2019 年 8 月

Stratusburg 大学 2018 年 2 月

一些文本和月份后跟年份

我已经研究过诸如此类的问题 一)Constructing Regular Expressions to match numeric ranges

b) Regex to match month name followed by year

和许多其他问题,但这些问题中的大多数输入字符串似乎都具有一些固定的月份和年份模式,而我没有。

我在 python 中尝试了这个正则表达式:

import re

pat = r"(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}(st|nd|rd|th)?)?(([,.\-\/])\D?)?((19[7-9]\d|20\d{2})|\d{2})*"

st =  "University of Pennsylvania February 2018"

re.search(pat, st)

但是从我的例子中识别有效和无效的字符串,我想在我的最终输出中避免无效的字符串。

对于输入“University of Pennsylvania February 2018”,预期输出应为 False

对于“2018 年 2 月至今”,输出必须为 True。

最佳答案

此 REGEX 验证符合此格式的日期范围 MONTH YEAR (MONTH YEAR | PRESENT)

import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY 
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
             '(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE  MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)


# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
    if line:
        groups = re.findall(RE_VALID_RANGE, line)
        if groups:
            # If you want to do something with range
            # all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
            # every group have 4 elements because there is 4 capturing group
            # if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
            M1, Y1, M2, Y2, present = groups[0]  # here use loop if you want to verify the values even more
            valid_ranges.append(line)
        else:
            invalid_ranges.append(line)

print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)


# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
    # if you want to check the ranges
    M1, Y1, M2, Y2, present = match.groups()
    valid_ranges.append(match.group(0))  # the text is returned here
print('VALID USING <finditer>: ',  valid_ranges)

输出:

VALID:  ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>:  ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']

我讨厌在单个 str 变量中编写长正则表达式 我喜欢在六个月后阅读我的代码时打破它以了解它的作用。注意第一行如何使用 finditer

分成两个有效范围字符串

如果你只想提取范围,你可以使用这个:

valid_ranges = re.findall(RE_VALID_RANGE, text)

但这会返回组 ([M1, Y1, M2, Y2, present)..] 而不是文本:

[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]

关于python - 匹配包含月份的日期范围的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58108251/

相关文章:

c# - 子字符串格式化以获取第一个下划线之后但第二个下划线之前的所有字符?

c - 匿名字符串文字效率低下吗?

string - 我的排列代码每次都打印 nPn 吗?

python - pygame:当前时间毫秒和增量时间

python - Dask 的默认 pip 安装给出 "ImportError: No module named toolz"

regex - 如何使用 htaccess 重写从 url 中删除此目录/文件夹?

Javascript 正则表达式 : how to match ONLY the given characters?

python - 在派生类中使用基类方法作为装饰器

Python - 将元组列表水平写入文本文件

javascript正则表达式来获取始终跟随城市和逗号的2个字符的状态