我有一个非常大的文件加载到我的主进程中。我的目标是同时从内存中读取多个进程以避免内存限制并使其更快。
根据this的回答,我应该使用Shared ctypes Objects
Manager types are built for flexibility not efficiency ... this necessarily means copying whatever object is in question. .... If you want shared physical memory, I suggest using Shared ctypes Objects. These actually do point to a common location in memory, and therefore are much faster, and resource-light.
所以我这样做了:
import time
import pickle
import multiprocessing
from functools import partial
def foo(_, v):
tp = time.time()
v = v.value
print(hex(id(v)))
print(f'took me {time.time()-tp} in process')
if __name__ == '__main__':
# creates a file which is about 800 MB
with open('foo.pkl', 'wb') as file:
pickle.dump('aaabbbaa'*int(1e8), file, protocol=pickle.HIGHEST_PROTOCOL)
t1 = time.time()
with open('foo.pkl', 'rb') as file:
contract_conversion = pickle.load(file)
print(f'load took {time.time()-t1}')
m = multiprocessing.Manager()
vm = m.Value(str, contract_conversion, lock=False) # not locked because i only read from it so its safe
foo_p = partial(foo, v=vm)
tpo = time.time()
with multiprocessing.Pool() as pool:
pool.map(foo_p, range(4))
print(f'took me {time.time()-tpo} for pool stuff')
但是我可以看到进程使用了它的副本(每个进程中的 ram 都非常高)并且它比简单地从磁盘读取慢得多。
打印:
load took 0.8662333488464355
0x1c736ca0040
took me 2.286606550216675 in process
0x15cc0404040
took me 3.178203582763672 in process
0x1f30f049040
took me 4.179721355438232 in process
0x21d2c8cc040
took me 4.913192510604858 in process
took me 5.251579999923706 for pool stuff
id 也不相同,但我不确定 id 是简单的 python 标识符还是内存位置。
最佳答案
您没有使用共享内存。那将是 multiprocessing.Value
,而不是 multiprocessing.Manager().Value
。您将字符串存储在管理器的服务器进程中,并通过 TLS 连接发送 pickle 以访问该值。此外,服务器进程在处理请求时受其自身 GIL 的限制。
我不知道这些方面中的每一个对开销的贡献有多大,但它总体上比读取共享内存更昂贵。
关于python - 访问共享内存然后从文件加载需要更长的时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52609672/