python - 如何销毁 Python 对象并释放内存

我正在尝试迭代超过 100,000 张图像并捕获一些图像特征并将生成的 dataFrame 作为 pickle 文件存储在磁盘上。

不幸的是，由于 RAM 的限制，我不得不将图像分成 20,000 个 block ，并在将结果保存到磁盘之前对它们执行操作。

下面编写的代码应该在开始循环处理接下来的 20,000 张图像之前保存 20,000 张图像的结果数据帧。

但是 - 这似乎并没有解决我的问题，因为在第一个 for 循环结束时内存没有从 RAM 中释放

因此在处理第 50,000 条记录时，程序因内存不足错误而崩溃。

我尝试在将对象保存到磁盘并调用垃圾收集器后将其删除，但 RAM 使用率似乎没有下降。

我错过了什么？

#file_list_1 contains 100,000 images
file_list_chunks = list(divide_chunks(file_list_1,20000))
for count,f in enumerate(file_list_chunks):
    # make the Pool of workers
    pool = ThreadPool(64) 
    results = pool.map(get_image_features,f)
    # close the pool and wait for the work to finish 
    list_a, list_b = zip(*results)
    df = pd.DataFrame({'filename':list_a,'image_features':list_b})
    df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")
    del list_a
    del list_b
    del df
    gc.collect()
    pool.close() 
    pool.join()
    print("pool closed")

最佳答案

现在，可能是第 50,000 个中的某些东西非常大，这导致了 OOM，所以为了测试这个我首先尝试:

file_list_chunks = list(divide_chunks(file_list_1,20000))[30000:]

如果它在 10,000 时失败，这将确认 20k 是否太大了，或者如果它在 50,000 时再次失败，则代码有问题...

好的，进入代码...

首先，您不需要显式的 list 构造函数，在 python 中迭代比将整个列表生成到内存中要好得多。

file_list_chunks = list(divide_chunks(file_list_1,20000))
# becomes
file_list_chunks = divide_chunks(file_list_1,20000)

我认为您可能在这里滥用了线程池:

Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

这看起来像 close 可能还有一些想法仍在运行，虽然我想这是安全的，但感觉有点不符合 Python 风格，最好使用 ThreadPool 的上下文管理器:

with ThreadPool(64) as pool: 
    results = pool.map(get_image_features,f)
    # etc.

python 中的显式del aren't actually guaranteed to free memory .

您应该在之后加入/之后收集:

with ThreadPool(..):
    ...
    pool.join()
gc.collect()

您也可以尝试将其分成更小的部分，例如10,000 甚至更少!

锤子 1

有一件事，我会考虑在这里做，而不是使用 pandas DataFrames 和大列表是使用 SQL 数据库，您可以使用 sqlite3 在本地执行此操作:

import sqlite3
conn = sqlite3.connect(':memory:', check_same_thread=False)  # or, use a file e.g. 'image-features.db'

并使用上下文管理器:

with conn:
    conn.execute('''CREATE TABLE images
                    (filename text, features text)''')

with conn:
    # Insert a row of data
    conn.execute("INSERT INTO images VALUES ('my-image.png','feature1,feature2')")

这样，我们就不必处理大型列表对象或 DataFrame。

你可以将连接传递给每个线程......你可能需要一些奇怪的东西，比如:

results = pool.map(get_image_features, zip(itertools.repeat(conn), f))

然后，在计算完成后，您可以从数据库中选择所有格式，选择您喜欢的格式。例如。使用 read_sql .

锤子 2

在这里使用一个子进程，而不是在同一个 python 实例中运行它“shell out”到另一个。

由于您可以将开始和结束作为 sys.args 传递给 python，因此您可以对这些进行切片:

# main.py
# a for loop to iterate over this
subprocess.check_call(["python", "chunk.py", "0", "20000"])

# chunk.py a b
for count,f in enumerate(file_list_chunks):
    if count < int(sys.argv[1]) or count > int(sys.argv[2]):
         pass
    # do stuff

这样，子进程将正确清理 python(不会有内存泄漏，因为进程将被终止)。

我敢打赌，Hammer 1 是可行的方法，感觉就像您在粘合大量数据，并不必要地将其读入 python 列表，而使用 sqlite3(或其他一些数据库)完全避免了这种情况。

关于python - 如何销毁 Python 对象并释放内存，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56126062/

python - 如何销毁 Python 对象并释放内存

锤子 1

锤子 2

上一篇：python - 有什么优雅的方法可以用 dtype 数组的列定义数据框吗？

下一篇：python - 我不断收到升级 pip 的消息