python - 如何在进程之间共享 pandas DataFrame 对象?

标签 python multithreading pandas multiprocessing object-sharing

这个问题和我之前发的链接是一样的。

( Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing? )

自从我遇到“DataFrame”对象共享问题以来,我对此一无所获。

我简化了示例代码。

如果有任何专业人士修改我的代码以在没有Manager.list、Manager.dict、numpy sharedmem的进程之间共享“DataFrame”对象, 我会很感激她或他。

这是代码。

#-*- coding: UTF-8 -*-'
import pandas as pd
import numpy as np
from multiprocessing import *
import multiprocessing.sharedctypes as sharedctypes
import ctypes

def add_new_derived_column(shared_df_obj):
    shared_df_obj.value['new_column']=shared_df_obj.value['A']+shared_df_obj.value['B'] / 2
    print shared_df_obj.value.head()
    '''
    "new_column" Generated!!!

          A         B  new_column
0 -0.545815 -0.179209   -0.635419
1  0.654273 -2.015285   -0.353370
2  0.865932 -0.943028    0.394418
3 -0.850136  0.464778   -0.617747
4 -1.077967 -1.127802   -1.641868
    '''

if __name__ == "__main__":

    dataframe = pd.DataFrame(np.random.randn(100000, 2), columns=['A', 'B'])

    # to shared DataFrame object, I use sharedctypes.RawValue
    shared_df_obj=sharedctypes.RawValue(ctypes.py_object, dataframe )

    # then I pass the "shared_df_obj" to Mulitiprocessing.Process object
    process=Process(target=add_new_derived_column, args=(shared_df_obj,))
    process.start()
    process.join()

    print shared_df_obj.value.head()
    '''
    "new_column" disappeared.
    the DataFrame object isn't shared.

          A         B
0 -0.545815 -0.179209
1  0.654273 -2.015285
2  0.865932 -0.943028
3 -0.850136  0.464778
4 -1.077967 -1.127802
    '''

最佳答案

您可以使用命名空间管理器,以下代码按预期工作。

#-*- coding: UTF-8 -*-'
import pandas as pd
import numpy as np
from multiprocessing import *
import multiprocessing.sharedctypes as sharedctypes
import ctypes

def add_new_derived_column(ns):
    dataframe2 = ns.df
    dataframe2['new_column']=dataframe2['A']+dataframe2['B'] / 2
    print (dataframe2.head())
    ns.df = dataframe2

if __name__ == "__main__":

    mgr = Manager()
    ns = mgr.Namespace()

    dataframe = pd.DataFrame(np.random.randn(100000, 2), columns=['A', 'B'])
    ns.df = dataframe
    print (dataframe.head())

    # then I pass the "shared_df_obj" to Mulitiprocessing.Process object
    process=Process(target=add_new_derived_column, args=(ns,))
    process.start()
    process.join()

    print (ns.df.head())

关于python - 如何在进程之间共享 pandas DataFrame 对象?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19887087/

相关文章:

python - socket ? python -m SimpleHTTPServer

Java - 在单独的线程中为每个客户端提供服务的服务器?

c++ - Boost线程,如何检查线程是否仍在运行?

java - 死线程是内存泄漏的威胁吗?

python - 映射和合并来自另一个数据框的值

python - 替换字符串中的特定字符

python - 从 Python 中的类中打印列表

python - 使用具有相应替换项的另一个 pandas df 替换 pandas 列中的值

python - 如何使用 pandas 间隔来查找值,以填充另一个数据框

python - 从 Pandas 数据框中选择特定列包含数字的行