python - 如何使用多重处理更快地迭代列表数据？

我正在尝试确定一组员工在轮类期间的工作时间 - 该数据以 CSV 文件的形式提供给我。

我用这些数据填充一个矩阵，并使用 while 循环应用必要的条件(例如，扣除 30 分钟的午餐时间)对其进行迭代。然后将其放入一个新列表中，用于制作 Excel 工作表。

我的脚本执行了预期的操作，但在必须循环处理大量数据时需要很长时间(它需要循环处理大约 26 000 行)。我的想法是使用多处理并行执行以下三个循环:

将时间从 hh:mm:ss 转换为分钟。
循环并应用条件。
对值进行舍入并转换回小时，这样就不会在大 while 循环内完成此操作。

这是个好主意吗？如果是这样，当我需要一个循环中的数据在下一个循环中使用时，如何让循环并行运行？我的第一个想法是使用时间函数来给出延迟，但随后我担心我的循环可能会相互“ catch ”并吐出被调用的列表索引不存在。

任何更有经验的意见都会很棒，谢谢!

我的脚本:

import pandas as pd

# Function: To round down the time to the next lowest ten minutes --> 77 = 70 ; 32 = 30:

def floor_time(n, decimals=0):

    multiplier = 10 ** decimals
    return int(n * multiplier) / multiplier

# Function: Get data from excel spreadsheet:

def get_data():

    df = pd.read_csv('/Users/Chadd/Desktop/dd.csv', sep = ',')
    list_of_rows = [list(row) for row in df.values]
    data = []
    i = 0
    while i < len(list_of_rows):
        data.append(list_of_rows[i][0].split(';'))
        data[i].pop()
        i += 1
    return data

# Function: Convert time index in data to 24 hour scale:

def get_time(time_data):

    return int(time_data.split(':')[0])*60 + int(time_data.split(':')[1])

# Function: Loop through data in CSV applying conditionals:

def get_time_worked():

    i = 0 # Looping through entry data
    j = 1 # Looping through departure data
    list_of_times = []

    while j < len(get_data()):

        start_time = get_time(get_data()[i][3])
        end_time = get_time(get_data()[j][3])

         # Morning shift - start time < end time
        if start_time < end_time:
            time_worked = end_time - start_time # end time - start time (minutes)
            # Need to deduct 15 minutes if late:
            if start_time > 6*60: # Late
                time_worked = time_worked - 15
            # Need to set the start time to 06:00:00:
            if start_time < 6*60: # Early
                time_worked = end_time - 6*60

        # Afternoon shift - start time > end time
        elif start_time > end_time:
            time_worked = 24*60 - start_time + end_time # 24*60 - start time + end time (minutes)
            # Need to deduct 15 minutes if late:
            if start_time > 18*60: # Late
                time_worked = time_worked - 15
            # Need to set the start time to 18:00:00:
            if start_time > 18*60: # Early
                time_worked = 24*60 - 18*60 + end_time

        # If time worked exceeds 5 hours, deduct 30 minutes for lunch:
        if time_worked >= 5*60:
            time_worked = time_worked - 30

        # Set max time worked to 11.5 hours:
        if time_worked > 11.5*60:
            time_worked = 11.5*60

        list_of_times.append([get_data()[i][1], get_data()[i][2], round(floor_time(time_worked, decimals = -1)/60, 2)])

        i += 2
        j += 2

    return list_of_times

# Save the data into Excel worksheet:

def save_data():

    file_heading = '{} to {}'.format(get_data()[0][2], get_data()[len(get_data())-1][2])
    file_heading_2 = file_heading.replace('/', '_')

    df = pd.DataFrame(get_time_worked())
    writer = pd.ExcelWriter('/Users/Chadd/Desktop/{}.xlsx'.format(file_heading_2), engine='xlsxwriter')
    df.to_excel(writer, sheet_name='Hours Worked', index=False)
    writer.save()

save_data()

最佳答案

您可以查看multiprocessing.Pool，它允许使用不同的输入变量多次执行函数。来自 docs

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

然后，需要将数据分割成 block (而不是示例中的 [1, 2, 3])。
但是，我个人的偏好是花时间学习默认分发的东西。例如 Spark 和 pyspark。从长远来看，这将对您有很大帮助。

关于python - 如何使用多重处理更快地迭代列表数据？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61139022/

python - 如何使用多重处理更快地迭代列表数据？

上一篇：openstack - Neutron 错误 : oslo_privsep. daemon.FailedToDropPrivileges:privsep 帮助程序命令退出非零(1

下一篇：python - 为什么 QTimer 不能在对象中工作？ python PyQt