我有以下一堆文字:
text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""
我想拆分如下:
RE = r'(SECTION.*?SECTION)'
m = re.findall(RE, text, re.DOTALL)
sections = []
if m:
for match in m:
sections.append(match)
希望它会产生一个包含 4 个元素的列表,但我最终只有 2 个元素。
['SECTION 1. .....', 'SECTION 3. .....'] # only showing the first letters of each element
之后,我想对章节
和文章
做同样的事情。
有什么想法吗?
最佳答案
假设单词 SECTION
仅在字符串中有新的“section”时出现,您始终可以使用默认的 .split
方法,这样更容易比使用正则表达式。
这是一个例子:
text = """SECTION 1. CHAPTER 1. Chapter title. Art. 1.- Lorem ipsum, blah, blah. Art 2.- More meaningless text. Art 3.- A little more text. CHAPTER 2. Another chapter. Art 4.- Lorem ipsum blah, blah, blah. Art. 5.- It's getting boring. SECTION 2. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 3. CHAPTER 1. Another chapter in another section. Art. 6.- The last text. SECTION 4. CHAPTER 1. Another chapter in another section. Art. 6.- The last text."""
delimiter = 'SECTION'
sections = [delimiter + s for s in text.split(delimiter)[1:]]
结果将是:
>>> sections
['SECTION 1. ...', 'SECTION 2. ...', 'SECTION 3. ...', 'SECTION 4. ...']
关于python - 如何使用正则表达式从文本中构建 python 列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33976539/