python - 如何在 pandas 数据帧上使用 pool.starmap()？

关于 this post 的第二个答案, 我试过下面的代码

from multiprocessing import Pool
import numpy as np
from itertools import repeat
import pandas as pd

def doubler(number, r):
    result = number * 2 + r
    return result

def f1():
    return np.random.randint(20)

if __name__ == '__main__':
    df = pd.DataFrame({"A": [10,20,30,40,50,60], "B": [-1,-2,-3,-4,-5,-6]})
    num_chunks = 3
    # break df into 3 chunks
    chunks_dict = {i:np.array_split(df, num_chunks)[i] for i in range(num_chunks)}

    arg1 = f1()

    with Pool() as pool:
        results = pool.starmap(doubler, [zip(chunks_dict[i]['B'], repeat(arg1)) for i in range(num_chunks)])

    print(results)

>>> [(-1, 20, -1, 20, -2, 20), (-3, 20, -3, 20, -4, 20), (-5, 20, -5, 20, -6, 20)]

这不是我想要的结果。我想要的是将 df 的 B 列的每个元素以及 f1< 的输出提供给 doubler 函数 - 这就是为什么我使用 starmap 和 repeat - 将输入的列表输出加倍并添加一些随机整数。

比如f1的输出是2，那么我要返回

>>> [0,-2,-4,-6,-8,-10] # [2*(-1) + 2, 2*(-2) + 2, ... ]

谁能告诉我如何实现这个预期的结果？谢谢

编辑:插入整个数据框也不起作用:

with Pool() as pool:
    results = pool.starmap(doubler, [zip(df['B'], repeat(arg1))])

>>> TypeError: doubler() takes 2 positional arguments but 6 were given

本质上，我只想将我的数据帧分解成 block ，然后将这些 block 以及其他变量 (arg1) 放入一个接受多个参数的函数中。

最佳答案

你的论据看起来不对。例如，在 doubler 中添加参数的 print 后，我看到以下内容(假设 f1() 返回 2):

doubler number (-3, 2) r (-4, 2)
doubler number (-1, 2) r (-2, 2)
doubler number (-5, 2) r (-6, 2)

这是因为传递给 starmap 的参数被压缩 在一起，而不是只是一个 tuples 列表。

我认为重写分块过程和参数生成要容易得多。假设我理解正确，您希望为参数生成以下元组列表(假设 f1() 返回 2):

[(-1, 2), (-2, 2), (-3, 2), (-4, 2), (-5, 2), (-6, 2)]

然后这将应用于 doubler 函数，这样 starmap 返回此 [doubler(-1, 2), doubler(-2, 2 ),...doubler(-6, 2)] 即 [[0, -2, -4, -6, -8, -10]。试试这个:

from multiprocessing import Pool
import numpy as np
from itertools import repeat
import pandas as pd


def doubler(number, r):
    result = number * 2 + r
    return result


def f1():
    return np.random.randint(20)


if __name__ == '__main__':
    df = pd.DataFrame({"A": [10, 20, 30, 40, 50, 60], "B": [-1, -2, -3, -4, -5, -6]})
    num_processes = 3

    # the "r" value to use with every "B" value
    random_r = f1()

    # zip together a list of tuples of each B value and the random r value
    tuples = [(b, r) for b, r in zip(df.B.values, repeat(random_r, len(df.B.values)))]
    print(tuples)

    with Pool(num_processes) as pool:
        results = pool.starmap(doubler, tuples)

    print(results)

关于python - 如何在 pandas 数据帧上使用 pool.starmap()？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47204452/

python - 如何在 pandas 数据帧上使用 pool.starmap()？

上一篇：python - 删除 pandas df 的重复项

下一篇：python - Tensorflow:tf.assign 不分配任何东西