python - Dask dataframe apply 在传递局部变量作为参数时给出意外的结果

当在 for 循环内调用 dask DataFrame 的 apply 方法(其中我使用迭代器变量作为 apply 的参数)时，我稍后执行计算时会得到意想不到的结果。此示例显示了行为:

import dask.dataframe as dd
import random
import numpy as np

df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 
                   'col_2': random.sample(range(10000), 10000) })
ddf = dd.from_pandas(df, npartitions=8)

def myfunc(x, channel):
    return channel

for ch in ['ch1','ch2']:
    ddf[f'df_apply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'df_apply_{ch}', np.unicode_))

print(ddf.head(5))

从 myfunc 的按行应用程序中，我希望看到另外两列，每一行上一列带有“ch1”，一列带有“ch2”。但是，这是脚本的输出:

   col_1  col_2 df_apply_ch1 df_apply_ch2
0   5485   2234          ch2          ch2
1   6338   6802          ch2          ch2
2   9408   5760          ch2          ch2
3   8447   1451          ch2          ch2
4   1230   3838          ch2          ch2

显然，循环的最终迭代覆盖了 apply 的第一个参数。事实上，以后在循环和调用 head 之间对 ch 的任何更改都会以同样的方式影响结果，覆盖我期望在两列中看到的内容。

这不是人们所看到的用纯 Pandas 做同样的练习。我还找到了 dask 的解决方法:

def myapply(ddf, ch):
    ddf[f'myapply_{ch}'] = ddf.apply(lambda row: myfunc(row,ch), axis=1, meta=(f'myapply_{ch}', np.unicode_))

for ch in ['ch1','ch2']:
    myapply(ddf, ch)

print(ddf.head(10))

给出:

   col_1  col_2 myapply_ch1 myapply_ch2
0   7394   3528         ch1         ch2
1   2181   6681         ch1         ch2
2   7945   1063         ch1         ch2
3   5164   8091         ch1         ch2
4   3569   2889         ch1         ch2

所以我发现这与用作应用参数的变量的范围有关，但我不明白为什么这种情况发生在 dask 中(仅)。这是预期的/预期的行为吗？

任何见解将不胜感激! :)

最佳答案

这毕竟是重复的，请参阅question on stackoverlow包括另一个解决方法。该行为的更详细解释可以在相应的issue on the dask tracker中找到。 :

This isn't a bug, this is just how python works. Closures evaluate based on the defining scope, if you change the value of trig in that scope then the closure will evaluate differently. The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value for trig.

其中 trig 是该讨论中使用的循环中的变量。

所以这不是一个 bug，也是 Python 的一个特性，由 dask 触发，而不是 pandas。

关于python - Dask dataframe apply 在传递局部变量作为参数时给出意外的结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57379734/

python - Dask dataframe apply 在传递局部变量作为参数时给出意外的结果

上一篇：python - 将 xml 文件解析为 csv 时跳过空元素

下一篇：python - 在 Seaborn 中，特定颜色可以覆盖已经基于另一列的色调吗？