python - 访问共享内存然后从文件加载需要更长的时间?

标签 python multiprocessing shared-memory

我有一个非常大的文件加载到我的主进程中。我的目标是同时从内存中读取多个进程以避免内存限制并使其更快。

根据this的回答,我应该使用Shared ctypes Objects

Manager types are built for flexibility not efficiency ... this necessarily means copying whatever object is in question. .... If you want shared physical memory, I suggest using Shared ctypes Objects. These actually do point to a common location in memory, and therefore are much faster, and resource-light.

所以我这样做了:

import time
import pickle
import multiprocessing
from functools import partial

def foo(_, v):
    tp = time.time()
    v = v.value
    print(hex(id(v)))
    print(f'took me {time.time()-tp} in process')

if __name__ == '__main__':
    # creates a file which is about 800 MB
    with open('foo.pkl', 'wb') as file:
        pickle.dump('aaabbbaa'*int(1e8), file, protocol=pickle.HIGHEST_PROTOCOL)

    t1 = time.time()
    with open('foo.pkl', 'rb') as file:
        contract_conversion = pickle.load(file)
    print(f'load took {time.time()-t1}')

    m = multiprocessing.Manager()
    vm = m.Value(str, contract_conversion, lock=False)  # not locked because i only read from it so its safe
    foo_p = partial(foo, v=vm)

    tpo = time.time()
    with multiprocessing.Pool() as pool:
       pool.map(foo_p, range(4))
    print(f'took me {time.time()-tpo} for pool stuff')

但是我可以看到进程使用了​​它的副本(每个进程中的 ram 都非常高)并且它比简单地从磁盘读取慢得多。


打印:

load took 0.8662333488464355
0x1c736ca0040
took me 2.286606550216675 in process
0x15cc0404040
took me 3.178203582763672 in process
0x1f30f049040
took me 4.179721355438232 in process
0x21d2c8cc040
took me 4.913192510604858 in process
took me 5.251579999923706 for pool stuff

id 也不相同,但我不确定 id 是简单的 python 标识符还是内存位置。

最佳答案

您没有使用共享内存。那将是 multiprocessing.Value,而不是 multiprocessing.Manager().Value。您将字符串存储在管理器的服务器进程中,并通过 TLS 连接发送 pickle 以访问该值。此外,服务器进程在处理请求时受其自身 GIL 的限制。

我不知道这些方面中的每一个对开销的贡献有多大,但它总体上比读取共享内存更昂贵。

关于python - 访问共享内存然后从文件加载需要更长的时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52609672/

相关文章:

python - python 检查两个给定的数字是否彼此接近

python - 如何评估随机森林分类器的性能?

python - 如何从 Airflow 中的文件执行 SQL 查询? (PostgresQL 运算符)

python - Matplotlib 堆叠直方图 numpy.ndarray 错误

python - 如何跟踪从多处理池返回的异步结果

python - 是否可以使用 multiprocessing.Event 为进程池实现同步屏障?

python - 可以生成动态变量吗?

c - 共享内存段中映射数据的内存管理

c++ - 为什么一个循环比另一个循环需要更长的时间来检测共享内存更新?

c - 使用 POSIX 共享内存和信号量以 block 的形式传输文件