pandas - block 上的 Groupby 可能会导致组在 block 之间 split

我有一些巨大的文件，由于它们的大小，我需要分块读取。

我想对这些文件执行 groupby 然后执行函数。

问题是，如果 block 大小是 50 000 并且如果一个组存在于 49998-50002 行，这个组将被分成两部分；一组在第一个 block 中，另一组在第二个 block 中。有没有办法解决chunk之间存在group的问题？

我能想到的所有解决方案都感觉非 Pandaish 所以也许我应该通过分两次阅读表格来解决这个问题。

最佳答案

我知道没有开箱即用的功能，但您可以通过以下方式推出自己的功能:

remainder = pd.DataFrame()
for filename in filenames:                                            # 1
    for chunk in pd.read_csv(filename, index_col=[0], chunksize=300): # 2 
        grouped = chunk.groupby(['group'])
        for grp, nextgrp in iwindow(grouped, 2):                      # 3
            group_num, df = grp                                       # 4
            if nextgrp is None:
                # When nextgrp is None, grp is the last group
                remainder = pd.concat([remainder, df])                # 5
                break                                                 # 6
            if len(remainder):                                        # 7
                df = pd.concat([remainder, df])
                remainder = pd.DataFrame()
            print(filename)
            process(df)                                               # 8
if len(remainder):                                                    # 9         
    process(remainder)

显然，我们需要遍历每个文件
分块读取文件。 chunksize=300 告诉 read_csv 以 300 字节为单位读取文件。对于下面的示例，我将其缩小了。您可以增加它以在每次迭代中读取更多文件。
iwindow 是一个滑动窗口实用函数。它一次返回两个 grouped 中的项目。例如，
```
In [117]: list(iwindow([1,2,3], 2))
Out[117]: [(1, 2), (2, 3), (3, None)]
```
df 是一个具有常量 group 值(等于 group_num)的 DataFrame。
不要处理最后一个 DataFrame，因为它可能是部分 DataFrame，下一个 block 中有更多数据帧。将其保存在 remainder 中。
跳出内部循环。继续下一个 block (如果有的话)。
如果 remainder 包含一些未处理的 DataFrame，将其添加到 df
最后，处理df
remainder 可能包含最后一个未处理的 DataFrame。所以现在处理它。

每当您需要分块读取文件但根据其他分隔符处理这些 block 时，这里采用的一般思想很有用。本质上是 same idea is used here将文件分成由正则表达式模式分隔的 block 。

例如，

import itertools as IT
import numpy as np
import pandas as pd

def make_data(groupsize, ngroups, filenames):
    nfiles = len(filenames)
    group_num = np.repeat(np.arange(ngroups), groupsize) 
    arr = np.random.randint(10, size=(len(group_num), 2))
    arr = np.column_stack([group_num, arr])
    for arri, filename in zip(np.array_split(arr, nfiles), filenames):
        df = pd.DataFrame(arri, columns=['group','A','B'])
        df.to_csv(filename) 

def iwindow(iterable, n=2, fillvalue=None):
    """
    Returns a sliding window (of width n) over data from the sequence.
    s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...
    """
    iterables = IT.tee(iterable, n)
    iterables = (IT.islice(it, pos, None) for pos, it in enumerate(iterables))
    for result in IT.izip_longest(*iterables, fillvalue=None):
        yield result

def process(df):
    print(df)
    print('-'*80)

filenames = ['/tmp/data-{}.csv'.format(i) for i in range(3)]
make_data(groupsize=40, ngroups=5, filenames=filenames)

remainder = pd.DataFrame()
for filename in filenames:
    for chunk in pd.read_csv(filename, index_col=[0], chunksize=300):
        grouped = chunk.groupby(['group'])
        for grp, nextgrp in iwindow(grouped, 2):
            group_num, df = grp
            if nextgrp is None:
                # When nextgrp is None, grp is the last group
                remainder = pd.concat([remainder, df])
                break
            if len(remainder):
                df = pd.concat([remainder, df])
                remainder = pd.DataFrame()
            print(filename)
            process(df)
if len(remainder):
    process(remainder)

关于pandas - block 上的 Groupby 可能会导致组在 block 之间 split ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29308446/

pandas - block 上的 Groupby 可能会导致组在 block 之间 split

上一篇：python - Pandas 将标量值添加到数字列？

下一篇：android - 通过终端命令在 Android 上打开振动