Python使用正则表达式匹配具有多个换行符的段落

标签 python regex paragraph findall

我尝试使用 Python 和 Re 来匹配段落。

文本示例:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.

two or more line breaks here

Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

two or more line breaks here

Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

这个表达式似乎几乎可以完成工作:

paragraphs = re.findall(r'(?s)((?:[^\n][\n]?)+)', textContent)

但我想确保仅在有两个或更多换行符时才匹配。目前匹配得太频繁了。

编辑:

ART. WEFWEFEW
  1 SDVSDRG: **<at the momemnt it breaks here, but it shouldnt>**
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth

编辑2:

ART. WEFWEFEW
   1 SDVSDRG: 
      **here are two line-breaks, but dont split this paragraph**
      **at the momemnt it breaks here, but it shouldnt**
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth

最佳答案

查看 RegEx101 上的这个正则表达式 (?m)(?:.+(?:\n.)?)+ ,您还可以在其中获得其解释。

使用此正则表达式的示例 Python 代码:

import re
import pprint

textContent = '''Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et
accusam et justo duo dolores et ea rebum.

Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.


Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna
aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren, no
sea takimata sanctus est Lorem ipsum dolor sit amet.



ART. WEFWEFEW
  1 SDVSDRG:
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth'''

pprint.pprint(re.findall(r'(?m)(?:.+(?:\n.)?)+', textContent))

输出:

['Lorem ipsum dolor sit amet, consetetur sadipscing elitr,\n'
 'sed diam nonumy eirmod tempor invidunt ut labore et dolore\n'
 'magna aliquyam erat, sed diam voluptua. At vero eos et\n'
 'accusam et justo duo dolores et ea rebum.',
 'Stet clita kasd gubergren, no sea takimata sanctus est Lorem\n'
 'ipsum dolor sit amet.',
 'Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam\n'
 'nonumy eirmod tempor invidunt ut labore et dolore magna\n'
 'aliquyam erat, sed diam voluptua. At vero eos et accusam et\n'
 'justo duo dolores et ea rebum. Stet clita kasd gubergren, no\n'
 'sea takimata sanctus est Lorem ipsum dolor sit amet.',
 'ART. WEFWEFEW\n'
 '  1 SDVSDRG:\n'
 '     a. wevvdfvdfd\n'
 '     b. sdfsdfsdfsdfsdfsdghtrhrth']

演示 Rextester .

关于Python使用正则表达式匹配具有多个换行符的段落,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57834383/

相关文章:

java - 正则表达式在被 [] 包围时拆分/除外

带 preg_match 的 PHP switch 语句

Python - 如何将段落与文本分开?

python - 如何通过opencv生成RGB直方图

python - 值错误: need more than 1 value to unpack

python - 如何在 Flask 中初始化 session ?

python - Groupby 到 csv 文件

Python 简单的正则表达式

css - 如何以这种方式嵌入视频,以便在视频向下移动后段落将继续在其之上?