python - 生成 python 多处理池时意外的内存占用差异

尝试为 pystruct 模块中的并行化做出一些优化，并在讨论中试图解释我的想法，即为什么我想在执行过程中尽早实例化池并尽可能长时间地保留它们，重用它们，我意识到我知道这样做效果最好，但我不完全知道为什么。

我知道在 *nix 系统上的说法是，池 worker 子进程在写入时从父进程中的所有全局变量复制。总体上确实如此，但我认为应该补充一点，当其中一个全局变量是一个特别密集的数据结构(如 numpy 或 scipy 矩阵)时，看起来任何被复制到 worker 中的引用实际上都很漂亮即使没有复制整个对象，它也是相当大的，因此在执行后期生成新池可能会导致内存问题。我发现最好的做法是尽早生成一个池，这样任何数据结构都很小。

我知道这个有一段时间了，并在工作中围绕它设计了应用程序，但我得到的最好的解释是我在此处的线程中发布的内容:

https://github.com/pystruct/pystruct/pull/129#issuecomment-68898032

看看下面的 python 脚本，基本上，您会期望第一次运行中池创建步骤中的可用内存和第二次运行中创建矩阵步骤中的可用内存基本相等，就像在两个最终池终止调用中一样。但它们从来没有，当你首先创建池时，总是有更多的空闲内存(当然除非机器上发生了其他事情)。这种影响随着创建池时全局命名空间中数据结构的复杂性(和大小)而增加(我认为)。有人对此有很好的解释吗？

我用 bash 循环和下面的 R 脚本制作了这张小图来说明，显示了创建池和矩阵后的总体可用内存，具体取决于顺序:

free memory trend plot, both ways

pool_memory_test.py:

import numpy as np
import multiprocessing as mp
import logging

def memory():
    """
    Get node total memory and memory usage
    """
    with open('/proc/meminfo', 'r') as mem:
        ret = {}
        tmp = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemTotal:':
                ret['total'] = int(sline[1])
            elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
                tmp += int(sline[1])
        ret['free'] = tmp
        ret['used'] = int(ret['total']) - int(ret['free'])
    return ret

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--pool_first', action='store_true')
    parser.add_argument('--call_map', action='store_true')
    args = parser.parse_args()

    if args.pool_first:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    else:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    if args.call_map:
        row_sums = p.map(sum, biggish_matrix)
        logging.debug('sum mapped:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p.terminate()
        p.join()
        logging.debug('pool terminated:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))

pool_memory_test.sh

#! /bin/bash
rm pool_first_obs.txt > /dev/null 2>&1;
rm matrix_first_obs.txt > /dev/null 2>&1;
for ((n=0;n<100;n++)); do
    python pool_memory_test.py --pool_first >> pool_first_obs.txt;
    python pool_memory_test.py >> matrix_first_obs.txt;
done

pool_memory_test_plot.R:

library(ggplot2)
library(reshape2)
pool_first = as.numeric(readLines('pool_first_obs.txt'))
matrix_first = as.numeric(readLines('matrix_first_obs.txt'))
df = data.frame(i=seq(1,100), pool_first, matrix_first)
ggplot(data=melt(df, id.vars='i'), aes(x=i, y=value, color=variable)) +
    geom_point() + geom_smooth() + xlab('iteration') + 
    ylab('free memory') + ggsave('multiprocessing_pool_memory.png')

编辑:修复脚本中由过度查找/替换和重新运行引起的小错误

EDIT2:“-0”切片？你能做到吗？ :)

EDIT3:更好的 python 脚本、bash 循环和可视化，暂时完成这个兔子洞 :)

最佳答案

您的问题涉及几个松散耦合的机制。这也是一个看起来很容易获得额外业力点的目标，但你会觉得有些不对劲，3 小时后这是一个完全不同的问题。因此，作为对我所有乐趣的返回，您可能会发现以下一些有用的信息。

TL;DR:测量已用内存，而非空闲内存。这为我提供了池/矩阵顺序和大对象大小(几乎)相同结果的一致结果。

def memory():
    import resource
    # RUSAGE_BOTH is not always available
    self = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    children = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
    return self + children

在回答您没有问过但密切相关的问题之前，这里有一些背景知识。

背景

最广泛的实现，CPython(2 和 3 版本)使用引用计数内存管理 [1]。每当您使用 Python 对象作为值时，它的引用计数器都会增加 1，并在引用丢失时减少。计数器是在 C 结构中定义的一个整数，用于保存每个 Python 对象 [2] 的数据。要点:引用计数器一直在变化，它与其他对象数据一起存储。

大多数“受 Unix 启发的操作系统”(BSD 系列、Linux、OSX 等)支持写时复制 [3] 内存访问语义。在 fork() 之后，两个进程有不同的内存页表指向相同的物理页。但是操作系统已将页面标记为写保护，因此当您进行任何内存写入时，CPU 会引发内存访问异常，操作系统会处理该异常以将原始页面复制到新位置。它的运行和嘎嘎声就像进程具有独立的内存一样，但是嘿，让我们节省一些时间(在复制时)和 RAM，而部分内存是等效的。要点:fork(或 mp.Pool)创建新进程，但它们(几乎)还没有使用任何额外内存。

CPython 将“小”对象存储在大型池(竞技场)中 [4]。在创建和销毁大量小对象的常见场景中，例如，函数内的临时变量，您不希望过于频繁地调用操作系统内存管理。其他编程语言(至少是大多数编译语言)为此目的使用堆栈。

来源

关于python - 生成 python 多处理池时意外的内存占用差异，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27809586/

python - 生成 python 多处理池时意外的内存占用差异

背景

相关问题

来源

上一篇：python - 为什么这个 numpy 数组太大而无法加载？

下一篇：python - Flask 和 SQLAlchemy 在 PostgreSQL 的事务连接中导致大量 IDLE