python - 匹配后re.sub。重复匹配组的所有实例，python

我使用 Python 中的正则表达式来匹配 str 中的数字。我的愿望是捕获可能有千位分隔符(对我来说，逗号或空格)或可能只是一串数字的数字。下面显示了我的正则表达式捕获的内容

>>> import re
>>> test = '3,254,236,948,348.884423 cold things, ' + \
'123,242 falling birds, .84973 of a French pen , ' + \
'65 243 turtle gloves, 8 001 457.2328009 units, and ' + \
'8d523c.'
>>> matches = re.finditer(ANY_NUMBER_SRCH, test, flags=re.MULTILINE)
>>> for match in matches:
...   print (str(match))
...
<_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
<_sre.SRE_Match object; span=(27, 34), match='123,242'>
<_sre.SRE_Match object; span=(37, 43), match='.84973'>
<_sre.SRE_Match object; span=(46, 52), match='65 243'>
<_sre.SRE_Match object; span=(55, 72), match='8 001 457.2328009'>
<_sre.SRE_Match object; span=(73, 74), match='8'>
<_sre.SRE_Match object; span=(75, 78), match='523'>

这是我想要的匹配行为。现在，我想获取每个匹配的数字并删除千位分隔符(',' 或 ' ')(如果存在)。这应该留给我

'3254236948348.884423 cold things, ' + \
'123242 falling birds, .84973 of a French pen ,' + \
'65243 turtle gloves, 8001457.2328009 units, ' + \
'and 8d523c.'

基本上，我有一个正则表达式来捕获数字。该正则表达式用于多个地方，例如查找美元金额、获取序数……因此，我将正则表达式命名为 ANY_NUMBER_SRCH。

我想做的事情如下:

matches = some_method_to_get_all_matches(ANY_NUMBER_SRCH)
for match in matches:
  corrected_match = re.sub(r"[, ]", "", match)
  change_match_to_corrected_match_in_the_test_string

事实上，我无法使用替换组。如果您只想查看正则表达式，可以查看 https://regex101.com/r/AzChEE/3基本上，我的正则表达式的一部分如下

r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)"

我将在没有“滚动线”的情况下表示它:

(r"(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})"
  "(?P<thousands_separator>[ ,])"
  "(?P<three_digits_w_sep>(?P<three_digits>\d{3})"
  "(?P=thousands_separator))*"
  "(?P<last_group_of_three>\d{3})(?!\d)")

正则表达式引擎不会保留重复的 two_digits_with_separator，因为 * 用于重复捕获组。

我确信有一种方法可以使用 _sre.SRE_Match 对象的 span 部分。然而，这会非常复杂，而且我正在处理包含数千到数十万字符的字符串。 有没有一种简单的方法可以在 re.match 或 re.iter 或使用任何其他方法之后执行 re.sub找到数字模式？

@abarnert 给了我正确的答案 - 使用 lambda 函数。我的评论在 @abarnert's answer ，以“已验证!”开头显示所有步骤。以防万一该评论出现了损坏的链接， .

<小时/>

我的尝试

顺便说一句，我已经看过SO上的这些问题( replace portion of match ， extract part of a match ， replace after matching pattern ， repeated capturing group stuff )，但它们只是展示了如何使用替换组。我还尝试使用 re.finditer ，如下所示，结果如下。

>>> matches = re.finditer(lib_re.ANY_NUMBER_SRCH, test, flags=re.MULTILINE)     
>>> for match in matches:
...   print ("match: " + str(match))
...   corrected_match = re.sub(r"[, ]", "", match)
...   print ("corrected_match: " + str(corrected_match))
...
match: <_sre.SRE_Match object; span=(0, 24), match='3,254,236,948,348.884423'>
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
>>>   print ("corrected_match: " + str(corrected_match))

<小时/>

大正则表达式

万一 regex101.com link 出现问题，这是巨大的正则表达式:

ANY_NUMBER_SRCH = r"(?P<number_capture>(?P<pre1>(?<![^0-9,.+-])|)(?P<number>(?P<sign_symbol_opt1>(?<![0-9])[+-])?(?P<whole_number_w_thous_sep>(?P<first_group>\d{1,3})(?P<thousands_separator>[ ,])(?P<three_digits_w_sep>(?P<three_digits>\d{3})(?P=thousands_separator))*(?P<last_group_of_three>\d{3})(?!\d)|(?P<whole_number_w_o_thous_sep>\d+))(?P<decimal_separator_1>[.])?(?P<fractional_w_whole_before>(?<=[.])(?P<digits_after_decimal_sep_1>\d+))?(?P<post1>(?<![^0-9,.+-])|)|(?P<pre2>(?<![^0-9,.+-])|)(?P<fractional_without_whole_before>(?P<sign_symbol_opt2>(?<![0-9])[+-])?(?P<decimal_separator_2>[.])(?P<digits_after_decimal_sep_2>\d+)))(?P<post2>(?<![^0-9,.+-])|))"

最佳答案

我认为您没有任何理由不能只使用 re.sub而不是这里的 re.finditer 。您的 repl 对每个匹配应用一次，并返回用 string 中的 repl 替换每个 pattern 的结果，这正是您想要的。

我实际上无法运行你的示例，因为复制和粘贴 test 会给我一个语法错误，而复制和粘贴 ANY_NUMBER_SRCH 会给我一个编译正则表达式的错误，并且我不想陷入困境，尝试修复所有错误，其中大多数错误甚至可能不在您的真实代码中。那么让我举一个更简单的例子:

>>> test = '3,254,236,948,348.884423 cold things and 8d523c'
>>> pattern = re.compile(r'[\d,]+')
>>> pattern.findall(test) # just to verify that it works
['3,254,236,948,348', '884423', '8', '523']
>>> pattern.sub(lambda match: match.group().replace(',', ''), test)
'3254236948348.884423 cold things and 8d523c'

显然，您的 repl 函数比仅删除所有逗号要复杂一些，而且您可能希望将其 def 置于外线，而不是而不是尝试将其塞入 lambda 中。但无论您的规则是什么，如果您将其编写为一个函数，该函数接受 match 对象并返回您想要的字符串来代替该匹配对象，您只需将该函数传递给 sub .

关于python - 匹配后re.sub。重复匹配组的所有实例，python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51640496/

python - 匹配后re.sub。重复匹配组的所有实例，python

我的尝试

大正则表达式

上一篇：python - 我在这个Python代码中哪里添加re.search？

下一篇：python - 合并不同时间间隔的 Pandas DataFrame 时，如何用常量值填充 NaN