python - 为什么同时读取多个文件比顺序读取慢？

我正在尝试解析在一个目录中找到的许多文件，但是使用多处理会减慢我的程序。

# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
                                                       combined ~100MB

按照 python 文档中的这个例子:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

我写了这段代码:

from multiprocessing import Pool
from api.ttypes import *

import gc
import os

def _parse(pathToFile):
    myList = []
    with open(pathToFile) as f:
        for line in f:
            s = line.split()
            x, y = [int(v) for v in s]
            obj = CoresetPoint(x, y)
            gc.disable()
            myList.append(obj)
            gc.enable()
    return Points(myList)

def getParsedFiles(pathToFile):
    myList = []
    p = Pool(2)
    for filename in os.listdir(pathToFile):
        if filename.endswith(".txt"):
            myList.append(filename)
    return p.map(_pars, , myList)

我按照这个例子，将所有以 .txt 结尾的文件名放在一个列表中，然后创建池，并将它们映射到我的函数。然后我想返回一个对象列表。每个对象都保存一个文件的解析数据。然而令我惊讶的是，我得到了以下结果:

#Pool 32  ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)

图表:

机器规范:

62.8 GiB RAM
Intel® Core™ i7-6850K CPU @ 3.60GHz × 12

我在这里错过了什么？
提前致谢!

最佳答案

看起来你是 I/O bound :

In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

您可能需要让主线程进行读取，并在子进程可用时将数据添加到池中。这将不同于使用 map。

由于您一次处理一行，并且输入被拆分，您可以使用 fileinput迭代多个文件的行，并映射到函数处理行而不是文件:

一次传递一行可能太慢，所以我们可以要求 map 传递 block ，并且可以调整直到找到最佳点。我们的函数解析大块的行:

def _parse_coreset_points(lines):
    return Points([_parse_coreset_point(line) for line in lines])

def _parse_coreset_point(line):
    s = line.split()
    x, y = [int(v) for v in s]
    return CoresetPoint(x, y)

还有我们的主要功能:

import fileinput

def getParsedFiles(directory):
    pool = Pool(2)

    txts = [filename for filename in os.listdir(directory):
            if filename.endswith(".txt")]

    return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)

关于python - 为什么同时读取多个文件比顺序读取慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42620323/

python - 为什么同时读取多个文件比顺序读取慢？

上一篇：python - 为什么我不能在异步函数中使用 'yield from'？

下一篇：python - 在 python 脚本中使用 conda install