python - 使用 Python RegEx re.findall 解析文本

标签 python regex findall

我有一个很长的字符串,我需要分组解析,但需要更多地控制它。

import re

RAW_Data = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*"

fromnode = re.findall('(.*?)(?=\*\s)', RAW_Data)

print fromnode

del fromnode
del RAW_Data

结果是:'用 1234 个数字和这个东西命名多个单词测试', '', '((Bla Bla Bla (Bla Bla) A40 & A41)) 用 3456 个数字和这个东西命名多个单词测试2' < em>........等等。

我似乎无法只捕获像“用 3456 个数字和这些东西命名多个单词测试”这样的字符串,而忽略像“((Bla Bla Bla (Bla Bla) A40 & A41))”这样的所有字符串。任何帮助将不胜感激。

最佳答案

你可以拆分

r'\*\s*\({2}.*?\){2}\s*'

模式 ( see demo ) 匹配:

  • \* - 文字星号
  • \s* - 零个或多个空格
  • \({2} - 恰好 2 个左括号
  • .*? - 除换行符外的零个或多个字符(注意:如果需要跨多行匹配,请添加 re.S 标志)少至可能到第一个
  • \){2} - 双右括号
  • \s* - 0+ 空格。

另外:same, but unrolled (thus, a bit more efficient) regex:

\*\s*\({2}[^)]*(?:\)(?!\))[^)]*)*\){2}\s*

参见 IDEONE demo:

import re
p = re.compile(r'\*\s*\({2}.*?\){2}\s*')
test_str = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*"
print(re.split(p, test_str))

更新

用于 re.findall 的正则表达式:

(?:\*\s*\(\([^)]*(?:\)(?!\))[^)]*)*\)\))?\s*([^*]*(?:\*(?!\s*\(\()[^*]*)*)\s*

查看 regex demo

被它的外表吓坏了吗?它只是更简单的 (?:\*\s*\(\(.*?\)\))?\s*(.*?(?=\*\s*(?:\(\(|$))) 的展开版本。

参见 IDEONE demo

关于python - 使用 Python RegEx re.findall 解析文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36924820/

相关文章:

python - 为什么 Bottle 不返回列表?

python - 正则表达式:匹配两个项目之间的文本

python - 如何向 pandas.DataFrame 列(列表)添加新元素?

JavaScript 正则表达式,搜索主题标签

Python findall 和正则表达式

python - 我如何使用正则表达式在 PDF 中搜索括号内的所有单词,除了一组特定的单词?

Python:海量数据的一次性编码

regex - 使用 RegEx 和 Replace 在 MS-Access 中使用 USPS 缩写更新地址字段

java - 正则表达式查找不匹配的字符串中的整数

Python re.findall() 返回空列表