Python/Pandas - 性能 - 尝试在全列操作中跳过 for 循环

我有一个名为目标的数据框:

target:

          group  estimation_error
170  64.22-1-00          0.061829
72   64.22-1-00          2.242214
121  35.12-3-00         31.960277
99   64.22-1-00          4.819315
19   35.12-3-00          0.850597

我想创建一个名为group_error的新列，它是同一组行的误差中位数。它看起来像这样:

          group  estimation_error median_group_error
170  64.22-1-00          0.061829           2.242214
72   64.22-1-00          2.242214           2.242214
121  35.12-3-00         31.960277          16.405437
99   64.22-1-00          4.819315           2.242214
19   35.12-3-00          0.850597          16.405437

我可以通过执行以下操作来做到这一点:

target['group_median_error']=""
groups=target.groupby('group')

for i in target.index:
    try:
        target['group_median_error'][i]=(groups.get_group(target.group[i])).estimation_error.median()
    except KeyError:
        pass

但是，由于这是一个很大的数据帧，因此花费的时间太长。我相信，如果我可以跳过 for 循环，我将获得相当大的性能提升。

出于这个目的，我尝试用以下内容替换 for 循环:

target['group_median_error']=(groups.get_group(target.group)).estimation_error.median()

但是它让我遇到以下错误:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

然后我提出问题:

有没有办法在不通过 for 循环的情况下执行相同的操作？
跳过该循环会提高性能吗？

最佳答案

我们可以用矢量化(不循环)的方式来实现:

In [11]: df['median_group_error'] = \
            df.groupby('group')['estimation_error'].transform('median')

In [12]: df
Out[12]:
          group  estimation_error  median_group_error
170  64.22-1-00          0.061829            2.242214
72   64.22-1-00          2.242214            2.242214
121  35.12-3-00         31.960277           16.405437
99   64.22-1-00          4.819315            2.242214
19   35.12-3-00          0.850597           16.405437

关于Python/Pandas - 性能 - 尝试在全列操作中跳过 for 循环，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44808468/

Python/Pandas - 性能 - 尝试在全列操作中跳过 for 循环

上一篇：Python - 翻转文件中字节的有效方法？

下一篇：python - 在函数之外使用变量