python - 删除巨大的 csv 中已知的确切行

我有一个约 2.2 亿行、7 列的 csv 文件。我需要删除第 2636759 行。这个文件是 7.7GB，超过内存。我最熟悉 R，但也可以在 python 或 bash 中执行此操作。

我无法在一次操作中读取或写入此文件。在磁盘上增量构建此文件而不是尝试在内存中全部构建的最佳方法是什么？

我试图在 SO 上找到这个，但只能找到如何对小到足以在内存中读/写的文件或文件开头的行执行此操作。

最佳答案

Python 解决方案:

import os
with open('tmp.csv','w') as tmp:

    with open('file.csv','r') as infile:
        for linenumber, line in enumerate(infile):
            if linenumber != 10234:
                tmp.write(line)

# copy back to original file. You can skip this if you don't
# mind (or prefer) having both files lying around           
with open('tmp.csv','r') as tmp:
    with open('file.csv','w') as out:
        for line in tmp:
            out.write(line)

os.remove('tmp.csv') # remove the temporary file

这会复制数据，如果磁盘空间有问题，这可能不是最佳选择。如果不先将整个文件加载到 RAM 中，就地写入将更加复杂

关键是python天生就支持处理files as iterables .这意味着它可以被懒惰地评估，你永远不需要一次将整个事情保存在内存中

我喜欢这个解决方案，如果您主要关心的不是原始速度，因为您可以用任何条件测试替换行 linenumber != VALUE，例如，过滤掉包含特定行的行日期

test = lambda line : 'NOVEMBER' in line
with open('tmp.csv','w') as tmp:
    ...
    if test(line):
    ...

In-place read-writes和 memory mapped file objects (这可能要快得多)将需要更多的簿记

关于python - 删除巨大的 csv 中已知的确切行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36779522/

python - 删除巨大的 csv 中已知的确切行

上一篇：python - 如何通过 bool 列过滤 Spark 数据帧？

下一篇：python - 无论如何，最终是否确保某些代码以原子方式运行？