Python:Pycharm 运行时

我目睹了 PyCharm 的一些奇怪的运行时问题，这些问题在下面进行了解释。该代码已在具有 20 个内核和 256 GB RAM 的机器上运行，并且有足够的内存可用。我没有展示任何实际功能，因为这是一个相当大的项目，但我非常乐意根据要求添加详细信息。

简而言之，我有一个具有以下结构的.py 文件项目:

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)
    temp_df, two_trips_df, time_dist_arr = st.run()

     # Prepare the dataframe to sample start times. Create groups from the input dataframe
    temp_df1 = st.prepare_two_trips_more_df(temp_df, two_trips_df)
    validation.logger.info("Dataframe prepared for multiprocessing")

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):  ### problem lies here in runtimes
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

################ PARALLEL CODE BELOW #################

for 循环 在 1050 万行和 18 列的数据帧上运行。在当前表单中，创建群组列表(280 万个群组)大约需要 25 分钟。这些组被创建，然后被提供给多进程池，其代码未显示。

它花费的 25 分钟相当长，因为我也完成了以下测试运行，只需要 7 分钟。本质上，我将 temp_df1 文件保存为 CSV，然后在预先保存的文件中进行批处理并运行与之前相同的 for 循环。

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)

    temp_df1 = pd.read_csv(r"c:\\...\\temp_df1.csv")
    time_dist = pd.read_csv(r"c:\\...\\start_time_distribution_treso_1.csv")
    time_dist_arr = np.array(time_dist.to_records())

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

问题那么，是什么导致代码在我只批处理文件时比文件作为更上游函数的一部分创建时运行速度快 3 倍？

提前致谢，请告诉我如何进一步澄清。

最佳答案

我正在回答我的问题，因为我在进行大量测试时偶然发现了答案，幸运的是，当我用谷歌搜索解决方案时，其他人也有相同的 issue .在上面的链接中可以找到为什么在执行 group_by 操作时使用分类列不是一个好主意的解释。因此，我不打算在这里发布它。谢谢。

关于Python:Pycharm 运行时，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51129600/

Python:Pycharm 运行时

上一篇：python - 海量数据汇总

下一篇：Python Twilio 与客户端进行调用