python - 如何在 python dask 中使用 group by describe with unstack 操作？

标签 python python-3.x dask dask-distributed dask-delayed

我正在尝试在 dask 中使用 describe() 和 unstack() 函数来获取数据的汇总统计信息。

但是，我得到如下所示的错误

import dask.dataframe as dd
df = dd.read_csv('Measurement_table.csv',assume_missing=True)
df.describe().compute() #this works but when I try to use `unstack`, i get an error

实际上我正在尝试在 dask 的帮助下使下面的 python pandas 代码运行得更快

df.groupby(['person_id','measurement_concept_id','visit_occurrence_id'])['value_as_number']
    .describe()
    .unstack()
    .swaplevel(0,1,axis=1)
    .reindex(df['readings'].unique(), axis=1, level=0)

我尝试将 compute() 添加到每个输出阶段，如下所示

df1 = df.groupby(['person_id','measurement_concept_id','visit_occurrence_id'])['value_as_number'].describe().unstack().swaplevel(0,1,axis=1).reindex(df['readings'].unique(), axis=1, level=0).compute()

我收到以下错误，但同样在 pandas 中运行良好

谁能帮我解决这个问题？

最佳答案

在 dask 中 unstack 没有实现，但是 describe 可以与 apply 一起使用:

df = (sd.groupby(['subject_id','readings'])['val']
        .apply(lambda x: x.describe())
        .reset_index()
        .rename(columns={'level_2':'func'})
        .compute()
        )
print (df)
    subject_id readings   func        val
0            1   READ_1  count   2.000000
1            1   READ_1   mean   6.000000
2            1   READ_1    std   1.414214
3            1   READ_1    min   5.000000
4            1   READ_1    25%   5.500000
..         ...      ...    ...        ...
51           4  READ_09    min  45.000000
52           4  READ_09    25%  45.000000
53           4  READ_09    50%  45.000000
54           4  READ_09    75%  45.000000
55           4  READ_09    max  45.000000

[112 rows x 4 columns]

关于python - 如何在 python dask 中使用 group by describe with unstack 操作？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58425031/

上一篇：python - BS4 getText 函数产生意外的输出

下一篇：python - 类文件对象的 `write` 方法返回什么？

相关文章：

python - 在当前进程完成其引导阶段之前尝试启动一个新进程

python - Python 调试器不会捕获单元测试中的异常

python - 如何让 Travis CI 安装 tests_require 中声明的 Python 依赖项？

python-3.x - 如何使用 Tokenize 模块标记化 python 代码？

python请求库: Function to get response from appended path

python - Dask Dataframe 独特操作 : Worker running out of memory (MRE)

python - 我应该如何获得 dask 数据框的形状？

python - 使用 Python 从法语 Word 文档中提取 XML 时出现问题 : illegal characters generated

python - Hadoop Streaming 命令失败并出现 Python 错误

python - 列表元素的计数器