我正在尝试确定一组员工在轮类期间的工作时间 - 该数据以 CSV 文件的形式提供给我。
我用这些数据填充一个矩阵,并使用 while 循环应用必要的条件(例如,扣除 30 分钟的午餐时间)对其进行迭代。然后将其放入一个新列表中,用于制作 Excel 工作表。
我的脚本执行了预期的操作,但在必须循环处理大量数据时需要很长时间(它需要循环处理大约 26 000 行)。 我的想法是使用多处理并行执行以下三个循环:
- 将时间从 hh:mm:ss 转换为分钟。
- 循环并应用条件。
- 对值进行舍入并转换回小时,这样就不会在大 while 循环内完成此操作。
这是个好主意吗? 如果是这样,当我需要一个循环中的数据在下一个循环中使用时,如何让循环并行运行?我的第一个想法是使用时间函数来给出延迟,但随后我担心我的循环可能会相互“ catch ”并吐出被调用的列表索引不存在。
任何更有经验的意见都会很棒,谢谢!
我的脚本:
import pandas as pd
# Function: To round down the time to the next lowest ten minutes --> 77 = 70 ; 32 = 30:
def floor_time(n, decimals=0):
multiplier = 10 ** decimals
return int(n * multiplier) / multiplier
# Function: Get data from excel spreadsheet:
def get_data():
df = pd.read_csv('/Users/Chadd/Desktop/dd.csv', sep = ',')
list_of_rows = [list(row) for row in df.values]
data = []
i = 0
while i < len(list_of_rows):
data.append(list_of_rows[i][0].split(';'))
data[i].pop()
i += 1
return data
# Function: Convert time index in data to 24 hour scale:
def get_time(time_data):
return int(time_data.split(':')[0])*60 + int(time_data.split(':')[1])
# Function: Loop through data in CSV applying conditionals:
def get_time_worked():
i = 0 # Looping through entry data
j = 1 # Looping through departure data
list_of_times = []
while j < len(get_data()):
start_time = get_time(get_data()[i][3])
end_time = get_time(get_data()[j][3])
# Morning shift - start time < end time
if start_time < end_time:
time_worked = end_time - start_time # end time - start time (minutes)
# Need to deduct 15 minutes if late:
if start_time > 6*60: # Late
time_worked = time_worked - 15
# Need to set the start time to 06:00:00:
if start_time < 6*60: # Early
time_worked = end_time - 6*60
# Afternoon shift - start time > end time
elif start_time > end_time:
time_worked = 24*60 - start_time + end_time # 24*60 - start time + end time (minutes)
# Need to deduct 15 minutes if late:
if start_time > 18*60: # Late
time_worked = time_worked - 15
# Need to set the start time to 18:00:00:
if start_time > 18*60: # Early
time_worked = 24*60 - 18*60 + end_time
# If time worked exceeds 5 hours, deduct 30 minutes for lunch:
if time_worked >= 5*60:
time_worked = time_worked - 30
# Set max time worked to 11.5 hours:
if time_worked > 11.5*60:
time_worked = 11.5*60
list_of_times.append([get_data()[i][1], get_data()[i][2], round(floor_time(time_worked, decimals = -1)/60, 2)])
i += 2
j += 2
return list_of_times
# Save the data into Excel worksheet:
def save_data():
file_heading = '{} to {}'.format(get_data()[0][2], get_data()[len(get_data())-1][2])
file_heading_2 = file_heading.replace('/', '_')
df = pd.DataFrame(get_time_worked())
writer = pd.ExcelWriter('/Users/Chadd/Desktop/{}.xlsx'.format(file_heading_2), engine='xlsxwriter')
df.to_excel(writer, sheet_name='Hours Worked', index=False)
writer.save()
save_data()
最佳答案
您可以查看multiprocessing.Pool
,它允许使用不同的输入变量多次执行函数。来自 docs
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
然后,需要将数据分割成 block (而不是示例中的 [1, 2, 3]
)。
但是,我个人的偏好是花时间学习默认分发的东西。例如 Spark 和 pyspark。从长远来看,这将对您有很大帮助。
关于python - 如何使用多重处理更快地迭代列表数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61139022/