python - 按列迭代 csv

我有一堆大型(约 400 万个值)的 csv 文件，我需要获取每一列并创建一个文件，以可以由不同程序解释的方式组织这些值。这些列的长度差异很大(200 万到 1000 个值之间)，每个 csv 可能有 4 到 100 列。

我可以将整个内容加载到 pandas.DataFrame 中，然后迭代该系列，但速度非常慢:

import pandas as pd
import re
import os
for f in os.listdir(folder):
    gc = pd.read_csv('{}/{}'.format(folder, f))
    strain = f[:-7] # files have regular name structure, this just gets the name

    with open('{}.txt'.format(strain), 'w+') as out_handle:
        for column in gc:
            series = gc[column]
            for i in range(len(series))[::10]:
                pos = i + 1
                gc_cont = s[i]
                if pd.isnull(gc_cont):
                    continue
                out_handle.write('{} {}'.format(pos, gc_cont) 
                # I'm writing other info, but it's not important here

也许用一百万个 + NaN 值填充较小的列并将整个内容加载到内存中会产生很大的性能成本？无论如何，我认为逐列阅读的效率会更高，但我找不到办法做到这一点。

Pandas 可以实现 block 大小 ( docs )，但那是对行进行分块。如果我逐行写入，我要么必须一次打开 4-100 个文件，要么多次迭代原始文件以写入每个单独的列。这些方法是否合适或者我缺少什么？

最佳答案

usecols怎么样？选项 read_csv ？另外，您可以考虑squeeze选项返回 pandas.Series如果您只使用单列，这可能会更快。类似的东西

cols = ['col0', 'col1', 'col2'] # the columns you want to load
for col in cols:
    data = pandas.read_csv(..., usecols=[col], squeeze=True)
    # format column data etc.

这是文档

usecols : array-like

Return a subset of the columns. Results in much faster parsing time and lower memory usage.

squeeze : bool 值，默认 False

If the parsed data only contains one column then return a Series

关于python - 按列迭代 csv，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32075425/

python - 按列迭代 csv

上一篇：python - 如何使用先前列表中的所有匹配项创建新的字符串列表？

下一篇：python - 如何使用 matlab.engine 在 Matlab 和 python 之间传递变量