Pytorch 数据集和共享内存？

我想在 torch.utils.data.Dataset 中缓存数据.简单的解决方案是将某些张量保留在数据集的成员中。然而，由于 torch.utils.data.DataLoader类产生多个进程，缓存只会是每个实例的本地，并且会导致我可能缓存相同张量的多个副本。有没有办法使用 Python 的多处理库在不同的加载程序进程之间共享数据？

最佳答案

答案取决于您的操作系统和设置。如果您使用 Linux 的默认进程启动方法，则不必担心重复或进程通信，因为工作进程共享内存!这通过共享内存被有效地实现为进程间通信(IPC)(更多细节here)。
对于 Windows，事情要复杂得多。来自 documentation :

Since workers rely on Python multiprocessing, worker launch behavior is different on Windows compared to Unix.

On Unix, fork() is the default multiprocessing start method. Using fork(), child workers typically can access the dataset and Python argument functions directly through the cloned address space.

On Windows, spawn() is the default multiprocessing start method. Using spawn(), another interpreter is launched which runs your main script, followed by the internal worker function that receives the dataset, collate_fn and other arguments through pickle serialization.

这意味着您动态缓存的 Dataset成员将在 Linux 上的所有进程之间自动共享。那太棒了!但是，在 Windows 上，进程不会收到它们的副本(它们仅在生成时收到 Dataset)，因此您应该使用进程通信方案，例如通过multiprocessing Pipe , Queue或 Manager (首选广播到多个进程，但您必须将张量转换为列表)。这不是很有效，而且实现起来相当麻烦。
尽管如此，还有另一种方法:内存映射(memmaping)。这意味着您的对象将被写入虚拟内存，并且所有进程都可以访问它，而这些对象的相应“影子副本”将在某个时候被刷新并存在于您的硬盘驱动器上(可以放置在/tmp 目录)。您可以将 memmaping 与 mmap 一起使用模块，在这种情况下，您的对象必须被序列化为二进制文件，或者您可以使用 numpy.memmap .您可以找到更多详细信息here .

关于Pytorch 数据集和共享内存？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60542153/

Pytorch 数据集和共享内存？

上一篇：jupyter-notebook - 如何通过命令行在 jupyter hub 中重启服务器？

下一篇：firebase - 使用 @firebase/testing 连接到 Firestore 模拟器