python - 使用正则表达式解析大文本文件

我有一个巨大的文本文件(1 GB)，其中每个“行”由##分隔。
例如:

## sentence 1 ## sentence 2
## sentence 3

我正在尝试根据 ## 分隔来打印文件。

我尝试了以下代码，但是 read() 函数压垮了(因为文件的大小)。

import re

dataFile = open('post.txt', 'r')
p = re.compile('##(.+)')

iterator = p.finditer(dataFile.read())
for match in iterator:
    print (match.group())

dataFile.close()

有什么想法吗？

最佳答案

这将以 block (chunksize 字节)的形式读取文件，从而避免与一次读取太多文件相关的内存问题:

import re
def open_delimited(filename, delimiter, *args, **kwargs):
    """
    http://stackoverflow.com/a/17508761/190597
    """
    with open(filename, *args, **kwargs) as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.split(delimiter, remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = pieces[-1]
        if remainder:
            yield remainder

filename = 'post.txt'
for chunk in open_delimited(filename, '##', 'r'):
    print(chunk)
    print('-'*80)

关于python - 使用正则表达式解析大文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18178089/

上一篇：python - '用户' 对象没有属性 'get'

下一篇：python - 打开巨大的文本文件并执行正则表达式搜索

相关文章：

python - 如何在python中将文件保存到特定目录并选择文件名？

javascript - Python 网页抓取，如何使用 Requests-HTML 库点击 'Next'

regex - 从 grep Perl 风格的正则表达式语法到实际的 Perl 正则表达式模式匹配

javascript - 如何检查字符串在JavaScript中的任何点是否包含字符

python - 列表对子列表的理解？

python - python的解包操作符*和**是如何使用的？

c# - 以下正则表达式在做什么 (?-mix :{0})

Javascript 正则表达式 urlify 文本

android - 将文本设置为粗体和斜体

python - 根据文本版本控制删除重复项