python - 使用正则表达式从字符串中提取文本

我有一个非常大的字符串。该字符串中有很多段落。每个段落都以标题开头，并遵循特定模式。

例子:

== Title1 ==//段落开始 ........................ ............//一些文本 ........................ 段落结束 ===Title2 ===//段落开始 ........................ ............//一些文本 ........................

标题的格式是:

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.

任何人都可以帮助我如何使用正则表达式执行此操作吗？时间差

最佳答案

你可以使用

re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)

参见 regex demo

详情

(?m)^ - 一行的开始
=+ - 1 个或多个 = 字符
[^\S\r\n]* - 除 CR 和 LF 之外的零个或多个空白字符
(.*?) - 第 1 组:任何零个或多个字符，换行字符除外，尽可能少
[^\S\r\n]* - 除 CR 和 LF 之外的零个或多个空白字符
=+ - 1 个或多个 = 字符
\s* - 0+ 个空格
(.*(?:\r?\n(?!==+.*?=).*)*) - 第 2 组:
- .* - 任何零个或多个字符，除换行字符外，尽可能多
- (?:\r?\n(?!=+.*?=).*)* - 零个或多个序列
  - \r?\n(?!=+.*?=) - 一个可选的 CR，然后是 LF，后面没有跟 1+ =，然后除换行符以外的任何字符尽可能少，然后再 1+ =s
  - .* - 任何零个或多个字符，除换行字符外，尽可能多

Python demo :

import re

rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))

输出:

[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]

关于python - 使用正则表达式从字符串中提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51893787/

python - 使用正则表达式从字符串中提取文本

上一篇：python - 如何在 Python 中将 dict 元素值作为列表

下一篇：python - 查找并替换 Pandas 数据框中的子字符串忽略大小写