python-3.x - Dask 应用自定义功能

我正在尝试使用 Dask，但在分组后使用 apply 时遇到了问题。

我有一个包含大量行的 Dask DataFrame。让我们考虑以下示例

N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)

我想对 col_1 的值进行分箱，并遵循 here 中的解决方案

bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)

哪里

def test_f(df,col,bins,labels):
    return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))

这正如我所期望的那样。

现在我想取每个箱中的中值(取自 here )

median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()

有 10 个垃圾箱，我预计中位数有 10 行，但实际上有 80 行。数据框有 8 个分区，所以我猜想应用程序以某种方式单独作用于每个分区。

但是，如果我想要平均值并使用 mean

median = ddf2.groupby('bin_num')['col_1'].mean().compute()

它可以工作并且输出有 10 行。

问题是:我做错了什么导致apply无法作为mean运行？

最佳答案

也许这个警告是关键(Dask doc: SeriesGroupBy.apply):

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

关于python-3.x - Dask 应用自定义功能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60711871/

python-3.x - Dask 应用自定义功能

上一篇：typescript - 如何将返回类型映射到元组中？

下一篇：windows - 无法获取 Windows Kubernetes 节点上的 cAdvisor 容器指标