python - 内存限制在大量文本文件上使用正则表达式

我有一个如下形式的文本文件:

('1', '2')
('3', '4')
     .
     .
     .

我正试图让它看起来像这样:

1 2
3 4
etc...

我一直在尝试使用 python 中的 re 模块来执行此操作，方法是将 re.sub 命令链接在一起，如下所示:

for line in file:
    s = re.sub(r"\(", "", line)
    s1 = re.sub(r",", "", s)
    s2 = re.sub(r"'", "", s1)
    s3 = re.sub(r"\)", "", s2)
    output.write(s3)
output.close()

在我接近输出文件末尾之前，它似乎工作得很好；然后它变得不一致并停止工作。我认为这是因为我正在处理的文件的大小； 300MB 或大约 1200 万行。

谁能帮我确认我只是内存不足？或者如果它是别的东西？合适的替代方案或解决方法？

最佳答案

您可以使用更简单的正则表达式来简化您的代码，该正则表达式可以找到您输入中的所有数字:

import re
with open(file_name) as input,open(output_name,'w') as output:
for line in input:
       output.write(' '.join(re.findall('\d+', line))
       output.write('\n')

关于python - 内存限制在大量文本文件上使用正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32721235/

上一篇：python - 两个日期之间的时间(周末除外)

下一篇：python - 在 Basemap 上绘制文本字符串代替 Python 中的点

python - 写python时如何避免验证码？

python - 使用 imaplib 下载多个附件

python - 当我在 python 中运行 cmd 时，cx_freeze 可执行文件不起作用

regex - 如何使用 regexp_matches 获取多个 mached 关键字

javascript - Node.js 检查名称是否与掩码匹配

python - 用 Python 编写一次性或匿名类？

python - 如何在 python 中为 t-SNE 添加标签

java - java中十进制数的奇怪正则表达式行为

python - 为什么 python 2's re module can' t 识别 u'®' 字符