python - 处理速度 - 编辑大型 2GB 文本文件 python

所以我有一个问题。我正在处理由 4 行的多个组成的 .txt 文件。我在 python 3 中工作。

我编写了一段代码，旨在获取文本文件的每第 2 行和第 4 行，并仅保留这两行的前 20 个字符(同时保留第 1 行和第 3 行未编辑)，并创建一个新的编辑文件，其中包含已编辑的第 2 行和第 4 行以及未编辑的第 1 行和第 3 行。这种趋势对于每一行都是相同的，因为我使用的所有文本文件的行号始终是 4 的倍数。

这适用于小文件(总共约 100 行)，但我需要编辑的文件超过 5000 万行，需要 4 个多小时。

下面是我的代码。谁能给我一个关于如何加快我的程序的建议？谢谢!

import io
import os
import sys

newData = ""
i=0
run=0
j=0
k=1
m=2
n=3
seqFile = open('temp100.txt', 'r')
seqData = seqFile.readlines()
while i < 14371315:
    sLine1 = seqData[j] 
    editLine2 = seqData[k]
    sLine3 = seqData[m]
    editLine4 = seqData[n]
    tempLine1 = editLine2[0:20]
    tempLine2 = editLine4[0:20]
    newLine1 = editLine2.replace(editLine2, tempLine1)
    newLine2 = editLine4.replace(editLine4, tempLine2)
    newData = newData + sLine1 + newLine1 + '\n' + sLine3 + newLine2
    if len(seqData[k]) > 20:
         newData += '\n'
    i=i+1
    run=run+1
    j=j+4
    k=k+4
    m=m+4
    n=n+4
    print(run)

seqFile.close()

new = open("new_100temp.txt", "w")
sys.stdout = new
print(newData)

最佳答案

如果您一次只读取 4 行并处理它们(未经测试)，可能会快得多:

with open('100temp.txt') as in_file, open('new_100temp.txt', 'w') as out_file:
    for line1, line2, line3, line4 in grouper(in_file, 4):
         # modify 4 lines
         out_file.writelines([line1, line2, line3, line4])

其中 grouper(it, n) 是一个函数，它一次产生迭代器 it 的 n 项。它作为 examples 之一给出itertools 模块(另见 this anwer at SO)。以这种方式遍历文件类似于在文件上调用 readlines()，然后手动遍历结果列表，但它一次只会将几行读入内存。

关于python - 处理速度 - 编辑大型 2GB 文本文件 python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19480902/

python - 处理速度 - 编辑大型 2GB 文本文件 python

上一篇：python - 在 Python 列表中查找并返回重复值的名称和计数

下一篇：python - 我如何对大量列表进行排序以获得 Python 中的前 10 名？