python - 用于计算特征值的多处理

我正在生成 100 个大小为 1000x1000 的随机整数矩阵。我正在使用多处理模块来计算 100 个矩阵的特征值。

代码如下:

import timeit
import numpy as np
import multiprocessing as mp

def calEigen():

 S, U = np.linalg.eigh(a)

def multiprocess(processes):
 pool = mp.Pool(processes=processes)
 #Start timing here as I don't want to include time taken to initialize the processes
 start = timeit.default_timer()
 results = [pool.apply_async(calEigen, args=())]
 stop = timeit.default_timer()
 print (processes":", stop - start) 

 results = [p.get() for p in results]
 results.sort() # to sort the results 


if __name__ == "__main__":

 global a
 a=[]

 for i in range(0,100):
  a.append(np.random.randint(1,100,size=(1000,1000)))

 #Print execution time without multiprocessing
 start = timeit.default_timer()
 calEigen()
 stop = timeit.default_timer()
 print stop - start 

 #With 1 process
 multiprocess(1)

 #With 2 processes
 multiprocess(2)

 #With 3 processes
 multiprocess(3)

 #With 4 processes
 multiprocess(4)

输出是

0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)

另一个迭代显示了这个输出:

 69.7296020985
 ('Process:', 1, 0.0009050369262695312)
 ('Process:', 2, 0.023727893829345703)
 ('Process:', 3, 0.0003509521484375)
 ('Process:', 4, 0.057518959045410156)

我的问题是:

为什么执行时间不随着次数的增加而减少进程增加？我是否正确使用了多处理模块？
我计算的执行时间是否正确？

我已经编辑了下面评论中给出的代码。我希望串行和多处理函数为同一个 100 个矩阵列表找到特征值。编辑后的代码是-

import numpy as np
import time
from multiprocessing import Pool

a=[]

for i in range(0,100):
 a.append(np.random.randint(1,100,size=(1000,1000)))

def serial(z):
 result = []
 start_time = time.time()
 for i in range(0,100):    
  result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
 end_time = time.time()
 print("Single process took :", end_time - start_time, "seconds")


def caleigen(c):  
 result = []        
 result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
 return result

def mp(x):
 start_time = time.time()
 with Pool(processes=x) as pool:  # start a pool of 4 workers
  result = pool.map_async(caleigen,a)   # distribute work to workers
  result = result.get() # collect result from MapResult object
 end_time = time.time()
 print("Mutltiprocessing took:", end_time - start_time, "seconds" )

if __name__ == "__main__":

 serial(a)
 mp(1,a)
 mp(2,a)
 mp(3,a)
 mp(4,a)

随着进程数量的增加，时间并没有减少。我哪里错了？ multiprocessing 是否将列表划分为进程的 block ，还是我必须进行划分？

最佳答案

您没有正确使用多处理模块。正如@dopstar 指出的那样，您没有划分任务。进程池只有一个任务，所以无论你分配多少 worker ，只有一个会得到这份工作。至于你的第二个问题，我没有使用 timeit 来精确测量处理时间。我只是使用 time 模块来大致了解事物的速度。不过，它在大多数情况下都能达到目的。如果我理解您要正确执行的操作，那么这应该是您代码的单进程版本

import numpy as np
import time

result = []
start_time = time.time()
for i in range(100):
    a = np.random.randint(1, 100, size=(1000,1000))  #generate random matrix
    result.append(np.linalg.eigh(a))                 #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")

单进程版本在我的电脑上花费了 15.27 秒。下面是多进程版本，在我的电脑上只用了 0.46 秒。我还包括了单进程版本以供比较。 (单进程版本也必须包含在 if block 中，并放在多进程版本之后。)因为你想重复计算 100 次，这样会容易得多创建一个工作池并让他们自动承担未完成的任务，而不是手动启动每个进程并指定每个进程应该做什么。在我的代码中，caleigen 调用的参数只是为了跟踪任务已执行了多少次。最后，map_async 通常比 apply_async 更快，其缺点是消耗的内存稍多，并且函数调用仅采用一个参数。使用 map_async 而不是 map 的原因是，在这种情况下，返回结果的顺序无关紧要，而且 map_async 更快比 map 。

from multiprocessing import Pool
import numpy as np
import time

def caleigen(x):     # define work for each worker
    a = np.random.randint(1,100,size=(1000,1000))   
    S, U = np.linalg.eigh(a)                        
    return S, U


if __name__ == "main":
    start_time = time.time()
    with Pool(processes=4) as pool:      # start a pool of 4 workers
        result = pool.map_async(caleigen, range(100))   # distribute work to workers
        result = result.get()        # collect result from MapResult object
    end_time = time.time()
    print("Mutltiprocessing took:", end_time - start_time, "seconds" )

    # Run the single process version for comparison. This has to be within the if block as well. 
    result = []
    start_time = time.time()
    for i in range(100):
        a = np.random.randint(1, 100, size=(1000,1000))  #generate random matrix
        result.append(np.linalg.eigh(a))                 #calculate eigen values and append to result list
    end_time = time.time()
    print("Single process took :", end_time - start_time, "seconds")

关于python - 用于计算特征值的多处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33462264/

python - 用于计算特征值的多处理

上一篇：python - pandas.merge 在使用 tzinfo 合并时间戳列时失败

下一篇：python - UserString 子类， 'str' 对象不可调用