你能告诉我优化这段代码的方法吗?由于数据量很大,需要几十分钟才能完成...
df['sinistre'] = 0
for index_sin, row_sin in sinistre1.iterrows():
date_surv = row_sin['DATESURV']
quit_sin = df.loc[df['id_police'] == row_sin['id_police']]
for index, row in quit_sin.iterrows():
if row['DATEEFFE'] < date_surv < row['DATE_FIN']:
df['sinistre'][index] = 1
这是 DataFrames sinistre1
和 df
的示例数据集:
>>> sinistre1
id_police id_sinistre DATESURV
0 p123 s123 30/05/2017
1 p123 s124 30/11/2017
2 p123 s125 29/02/2018
3 b123 s126 28/02/2018
4 b123 s127 30/05/2018
>>> df
id_police DATEEFFE DATE_FIN prime prime2
0 p123 24/01/2017 24/02/2017 0 0
1 p123 24/11/2017 24/12/2017 0 30
2 p123 25/02/2018 25/03/2018 10 10
3 b123 24/02/2018 24/03/2018 20 20
4 b123 24/03/2018 24/04/2018 30 0
这是预期的输出(想法是当 sinistre1
中的 DATESURV 位于区间 DATEEFFE
& DATE_FIN
内时,然后我旗帜险恶):
id_police DATEEFFE DATE_FIN prime prime2 sinistre
0 p123 24/01/2017 24/02/2017 0 0 0
1 p123 24/11/2017 24/12/2017 0 30 1
2 p123 25/02/2018 25/03/2018 10 10 1
3 b123 24/02/2018 24/03/2018 20 20 1
4 b123 24/03/2018 24/04/2018 30 0 0
如果我无法避免 for 循环,那么请展示一种更好的循环速度更快的方法...提前致谢!
最佳答案
正如我在评论中所述。接受的答案和合并现在没有意义,因为我认为 OP 想要比较两个数据帧中的每一行,因此也需要数据帧 df
中的键 id_sinistre
。或者想像下面这样使用 combine_first
:
df_merge = df.merge(sinistre1, on='id_police', how='left')
df_merge['DATESURV'] = pd.to_datetime(df_merge['DATESURV'])
df_merge['sinistre'] = np.where(df_merge['DATESURV'].between(df_merge['DATEEFFE'], df_merge['DATE_FIN']), 1, 0)
df_merge = df_merge.drop(['DATESURV', 'id_sinistre'], axis=1)
print(df_merge)
DATEEFFE DATE_FIN id_police prime prime2 sinistre
0 2017-01-24 2017-02-24 p123 0 0 0
1 2017-11-24 2017-12-24 p123 0 30 1
2 2018-02-25 2018-03-25 p123 10 10 1
3 2018-02-24 2018-03-24 b123 20 20 1
4 2018-03-24 2018-04-24 b123 30 0 0
关于python - 如何避免大数据集的 Pandas DataFrame 中的循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55668500/