我有以下模式要匹配:
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
对于某些上下文,它是一个较大文件的一部分,其中包含许多用逗号分隔的相似模式:
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
我的目标是放弃所有以“page”结尾的模式并保留其余的。为此,我正在尝试使用 正则表达式来识别这些模式。这是我现在推出的:
"\(.*?,\'page\'\)"
但是,它没有按预期工作。 在下面的 python 代码中,我使用这个正则表达式,并将每个匹配项替换为空字符串:
import re
txt = "(10,'Redirects_from_moves','*..2NN:,@2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,@2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,@2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
我期望 new_text 将包含所有以“subcat”结尾的模式,并删除所有 以“page”结尾的模式,但是,我得到:
new_txt = ,,,,
这里发生了什么?如何更改我的正则表达式以获得所需的结果?
最佳答案
我们可能会想在这里进行正则表达式替换,但这基本上总是会留下开放的边缘情况,正如 @Wiktor 在下面的评论中正确指出的那样。相反,更简单的方法是使用 re.findall
并简单地提取每个不以 'page'
结尾的元组。这是一个例子:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
打印:
(12,'Unprintworthy_redirects','*..2NN:,@2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
上面使用的正则表达式模式只匹配一个前导数字,后跟 5 个单引号术语,然后是第六个单引号术语,它不是 'page'
。然后,我们将列表输出中的元组连接起来形成一个字符串。
关于python - 如何编写正则表达式来匹配重复模式的一小部分?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66901417/