pandas - 滚动应用返回字典

标签 pandas dataframe apply window-functions

我有一个自定义函数,它返回一个 dict 并将其存储到每行的每个单元格中:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


def custom_rolling_apply(arr):
    return {'sum': np.sum(arr), 'mean': np.mean(arr)}

df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['rolling_dict'] = df['A'].rolling(window=3).apply(custom_rolling_apply, raw=True)

为什么这样说:

TypeError: must be real number, not dict

pandas 版本:1.5.3

最佳答案

您应该使用rolling.aggregate而不是应用

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})


df['rolling_dict'] = np.NaN
df['rolling_dict'] = df['rolling_dict'].astype('object')
df['A'].rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)

输出

    sum  mean
0   NaN  NaN
1   NaN  NaN
2   6.0  2.0
3   9.0  3.0
4   12.0 4.0

来自rolling.apply documentation :

func function Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False. Can also accept a Numba JIT function with engine='numba' specified

请注意,如果数据很大,apply 会带来性能损失:

import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas as pd


def custom_rolling_apply(arr):
    q={'sum':np.sum(arr), 'mean': np.mean(arr)}
    return q

def rolling_with_aggregate(arr):
    q=arr.rolling(window=3).aggregate({'sum': np.sum, 'mean': np.mean}, raw=True)
    return q

def profile_rolling_operation(data_size):
    rolling_times_1 = []
    rolling_times_2 = []
    data_sizes = []
    for i in range(1, data_size + 1):
        data_sizes.append(i)
        df = pd.DataFrame({'A': np.random.randint(1, 10, i)})
        elapsed_time_1 = timeit.timeit(lambda: [custom_rolling_apply(arr) for arr in df['A'].rolling(window=3)], number=2)
        rolling_times_1.append(elapsed_time_1)
        elapsed_time_2 = timeit.timeit(lambda: rolling_with_aggregate(df['A']), number=2)
        rolling_times_2.append(elapsed_time_2)
    return data_sizes, rolling_times_1, rolling_times_2

max_data_size = 1000
data_sizes, rolling_times_1, rolling_times_2 = profile_rolling_operation(max_data_size)

plt.plot(data_sizes, rolling_times_1, label='Custom Rolling Apply')
plt.plot(data_sizes, rolling_times_2, label='Rolling with Aggregate')
plt.xlabel('Data Size')
plt.ylabel('Execution Time (seconds)')
plt.title('Comparison')
plt.legend()
plt.show()

enter image description here

关于pandas - 滚动应用返回字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76613293/

相关文章:

python - 使用 Python Selenium 在内存中下载文件,而不是在磁盘中

python - 将相同的字符串连接到数组或系列的所有元素

python - 导入多个Excel文件并合并到单个pandas df中,源名称为列

python - withColumn 与 UDF 产生 AttributeError : 'NoneType' object has no attribute '_jvm'

python - 基于 Pandas 数据框中分组的总和?

R - 为矩阵的每一行/列应用具有不同参数值的函数

R:检查矩阵的每一列中向量的每个元素是否存在的最快方法

python - 当涉及无穷大值时, Pandas 滚动返回 NaN

python - 将工作日名称字符串转换为日期时间

python - 如何在 Pandas 数据帧切片中使用 apply 来设置多列的值