python - 如何读取文件并提取多行模式之间的数据?

标签 python regex python-3.x pattern-matching

我有一个文件,需要从中提取一段数据,由(可能)多行固定模式分隔

some data ... [my opening pattern
is here
and can be multiline] the data 
I want to extract [my ending
pattern which can be
multiline as well] ... more data

这些模式是固定的,因为内容始终相同,除了单词之间可以包含换行符。

如果我确信我的模式将采用可预测的格式,那么解决方案就会很简单,但事实并非如此。

有没有办法将这样的“模式”与流相匹配?

有一个question这几乎是重复的,答案指向缓冲输入。我的情况的不同之处在于,我知道模式中的确切字符串,除了单词也可能由换行符分隔(因此不需要 \w* 类型的匹配)

最佳答案

你在找这个吗?

>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']

更新要将大文件读取为 block ,我建议采用以下方法:

## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re

class ChunkIter:
    def __init__(self, f, delim):
        """ f: file object
        delim: regex pattern"""        
        self.f = f
        self.delim = re.compile(delim)
        self.buffer = ''
        self.part = '' # the string to return

    def read_to_delim(self):
        """Return characters up to the last delim, or None if at EOF"""

        while "delimiter not found":
            b = self.f.read(256)
            if not b: # if EOF
                self.part = None
                break
            # Continue reading to buffer
            self.buffer += b
            # Try regex split the buffer string    
            parts = self.delim.split(self.buffer)
            # If pattern is found
            if parts[:-1]:
                # Retrieve the string up to the last delim
                self.part = ''.join(parts[:-1])
                # Reset buffer string
                self.buffer = parts[-1]
                break   

        return self.part

if __name__ == '__main__':
    with open('input.txt', 'r') as f:
        chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
        while chunk.read_to_delim():
             print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)

    print 'job done.'

关于python - 如何读取文件并提取多行模式之间的数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35888841/

相关文章:

python - 将极坐标重新投影到笛卡尔网格

Python itertools.combinations 从某个值继续?

javascript - 如何在 Javascript 中使用 AM/PM 创建时间正则表达式

使用 ptb : Same line logged twice but only one handler added 进行 Python 日志记录

python-3.x - 选择器无效:使用Selenium,不允许使用复合类名称错误

python - 如何在子数据帧上过滤多维数据帧

python - 变量值中的继承 - Python

正则表达式帮助,反查询替换

java - 我可以将这个 "range string"与正则表达式匹配吗?

python - 尝试使用 pip 安装 cProfile 时出现错误