python - chunksize 不是从 csv 文件的第一行开始

使用 Python 3。

我有一个非常大的 CSV 文件，我需要将其拆分并保存到_csv。我使用 chunksize 参数来确定两个文件中我需要多少行。期望是第一个代码应该读取所需的行，以便我可以将它保存到第一个 CSV 文件中，第二个代码应该处理剩余的行，以便我可以将它们保存到第二个 CSV 文件中:

例如，假设文件有 3000 行并使用以下代码:

file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=None, chunksize=500)

我使用了 skiprows=None，因为我希望它从头开始并分块前 500 个。

然后，第二个代码应该跳过前面的 500 和剩余的 block :

file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=500, chunksize=2500)

但是，我从第一个代码得到的结果是它总是直接进入并分块最后 500 个而不是按预期从头开始。如果 chunksize 总是跳到最后一个给定的数字，听起来 skiprows 并没有按预期工作。

对于此处可能发生的事情的任何建议，我们将不胜感激。

最佳答案

只要您不为 chunksize 使用默认值(不是 None)参数 pd.read_csv 返回一个 TextFileReader 迭代器而不是 DataFrame。 pd.read_csv() 将尝试分块读取您的 CSV 文件(具有指定的 block 大小):

reader = pd.read_csv(filename, chunksize=N)
for df in reader:
    # process df (chunk) here

因此，当使用 chunksize 时 - 所有 block (除了最后一个 block )都将具有相同的长度。使用 iterator 参数，您可以定义每次迭代中要读取的数据量 (get_chunk(nrows)):

In [66]: reader = pd.read_csv(fn, iterator=True)

让我们读前三行

In [67]: reader.get_chunk(3)
Out[67]:
          a         b         c
0  2.229657 -1.040086  1.295774
1  0.358098 -1.080557 -0.396338
2  0.731741 -0.690453  0.126648

现在我们将阅读接下来的 5 行:

In [68]: reader.get_chunk(5)
Out[68]:
          a         b         c
0 -0.009388 -1.549381  0.913128
1 -0.256654 -0.073549 -0.171606
2  0.849934  0.305337  2.360101
3 -1.472184  0.641512 -1.301492
4 -2.302152  0.417787  0.485958

接下来的 7 行:

In [69]: reader.get_chunk(7)
Out[69]:
          a         b         c
0  0.492314  0.603309  0.890524
1 -0.730400  0.835873  1.313114
2  1.393865 -1.115267  1.194747
3  3.038719 -0.343875 -1.410834
4 -1.510598  0.664154 -0.996762
5 -0.528211  1.269363  0.506728
6  0.043785 -0.786499 -1.073502

来自 docs :

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks with get_chunk().

chunksize : int, default None

Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

关于python - chunksize 不是从 csv 文件的第一行开始，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48250655/

python - chunksize 不是从 csv 文件的第一行开始

上一篇：python-docx 获取标题文本

下一篇： python Pycharm : Describe Dataframe not Shown