python - 使用 python 从文章中提取数据的多个正则表达式模式

标签 python regex

Python 的新手，但对生活来说却很老。我试图使用 txt 文件中的多个正则表达式模式从新闻文章 txt 文件中提取数据。我已经到了可以找到匹配但不保存提取的数据的地步。到目前为止，这是我在原始的不卫生的非 pythonic 脚本中所拥有的。感谢所有评论，因为我正在自学。

import re

reg_ex = open('APT1.txt', "r", encoding = 'utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';') 


reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(re.compile(raw_regex))


get_string = open("APT.txt", "r", encoding = 'utf-8-sig')
nystring = get_string.read()


if any(compiled_reg.search(nystring) for compiled_reg in reggie_lst):
    print("Got some Matches")

最佳答案

您可以使用 re.findall() 将您的数据提取到列表中，而不是仅仅询问正则表达式是否匹配。

import re

reg_ex = open('APT1.txt', "r", encoding='utf-8-sig')
lines = reg_ex.read()
strip = lines.strip()
reggie = strip.split(';')

reggie_lst = []
match_lst = []

for raw_regex in reggie:
    reggie_lst.append(raw_regex)

get_string = open("APT.txt", "r", encoding='utf-8-sig')
nystring = get_string.read()


for reg in reggie_lst:
    for text_match in re.findall(reg, nystring):
        print("Got match for regex {}: {}".format(reg, text_match))

当然，除了在最后一行打印它，您还可以将它保存在一个新文件中。在此示例中，我还删除了仅为打印/调试目的编译正则表达式。

在正则表达式中使用括号(组)时要小心。 re.findall() 行为与 re.search() 或 re.match() 略有不同。您必须使用 (?: … 然后，另请参阅 this post 。

关于python - 使用 python 从文章中提取数据的多个正则表达式模式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52618015/

上一篇：python - Django - 如何为 values_list 中的每个元组添加一个值

下一篇：python - unittest blacklist namespace 并且任何引用它的尝试都失败

相关文章：

python-docx 从段落中获取表格

python - for 循环在 Python 中如何工作？喜欢在内心深处发生什么过程？

python - 将数据帧列分解为多行(TypeError : Cannot cast array data from dtype ('int64' ) to dtype ('int32' ))

c# - 使用 String.Format 创建正则表达式

javascript - JQUERY/JavaScript - 正则表达式之后的所有内容，包括第 3 次出现的字符

python - 在 Python 中动态增加类变量

java - 正则表达式:匹配任何非单词和非数字字符，除了

javascript RegEx 主题标签匹配 #foo 和 #foo-fåäö 但不匹配 http ://this. is/no#hashtag

PHP 正则表达式 - 仅过滤器编号

Python网络浏览器在同一选项卡中打开url