Python - 使用基于另一个数据框中的索引的条件填充数据框的最快方法

我在输入数据帧 (input_df) 中有数据。基于另一个基准数据帧 (bm_df) 中的索引，我想创建第三个数据帧 (output_df)，它根据使用原始两个数据帧中的索引的条件进行填充。

对于 bm_df 索引中的每个日期，我想使用 input_df 中可用的最新数据填充我的输出，前提是数据的索引日期早于或等于 bm_df 中的索引日期。例如，在案例研究中，第一个索引日期 (2019-01-21) 输出数据框下方的数据将填充来自 2019-01-21 的 input_df 数据点的数据。但是，如果 2019-01-21 的数据点不存在，这将使用 2019-01-18。

这里的用例是映射和回填大型数据集以获取给定日期的最新可用数据。我已经编写了一些 python 来为我做这件事(这很有效)，但是我认为可能有更多 pythonic 因此更快的方法来实现该解决方案。我应用的基础数据集在列数和列长度方面具有很大的维度，因此我想要尽可能高效的东西 - 在我正在使用的完整数据集上运行时，我当前的解决方案太慢了。

非常感谢任何帮助!

输入自由度:

index   data
2019-01-21  0.008
2019-01-18  0.016
2019-01-17  0.006
2019-01-16  0.01
2019-01-15  0.013
2019-01-14  0.017
2019-01-11  0.017
2019-01-10  0.024
2019-01-09  0.032
2019-01-08  0.012

bm_df:

index   
2019-01-21  
2019-01-14  
2019-01-07

输出方向:

index   data
2019-01-21  0.008
2019-01-14  0.017
2019-01-07  NaN

请在下面查看我当前使用的代码:

import pandas as pd
import numpy as np

# Import datasets
test_index = ['2019-01-21','2019-01-18','2019-01-17','2019-01-16','2019-01-15','2019-01-14','2019-01-11','2019-01-10','2019-01-09','2019-01-08']    
test_data = [0.008, 0.016,0.006,0.01,0.013,0.017,0.017,0.024,0.032,0.012]
input_df= pd.DataFrame(test_data,columns=['data'], index=test_index)

test_index_2= ['2019-01-21','2019-01-14','2019-01-07']  
bm_df= pd.DataFrame(index=test_index_2)

#Preallocate
data_mat= np.zeros([len(bm_df)])

#Loop over bm_df index and find the most recent variable from input_df which from a date before the index date 
for i in range(len(bm_df)):
    #First check to see if there are no dates before the selected date, if true fill with NaN
    if sum(input_df.index <= bm_df.index[i])>0:
        data_mat[i] = input_df['data'][max(input_df.index[input_df.index <= bm_df.index[i]])]
    else:
        data_mat[i] = float('NaN')

output_df= pd.DataFrame(data_mat,columns=['data'],index=bm_df.index)

最佳答案

我没有测试执行时间，但我会依赖 join 在 pandas 中被引用为 efficient documentation :

... Efficiently join multiple DataFrame objects by index at once...

我会使用 shift 来获取搜索日期之前的最高日期的值。

所有给予:

output_df = bm_df.join(input_df.shift(-1), how='left')

             data
2019-01-21  0.016
2019-01-14  0.017
2019-01-07    NaN

这种方法确实远不如显式循环通用。这是 Pandas 矢量化的代价。例如，对于小于或等于 条件，代码会略有不同。这是一个示例，在 bm_df 中有一个附加日期，但在 input_df 中不存在:

...
test_index_2= ['2019-01-21','2019-01-14','2019-01-13','2019-01-07']  
...
tmp_df = input_df.join(bm_df).fillna(method='bfill')
output_df = bm_df.join(tmp_df, how='inner')

我们得到了预期的结果:

             data
2019-01-21  0.008
2019-01-14  0.017
2019-01-13  0.017
2019-01-07  0.012

关于Python - 使用基于另一个数据框中的索引的条件填充数据框的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55088758/

Python - 使用基于另一个数据框中的索引的条件填充数据框的最快方法

上一篇：python - 在 for 循环中多次呈现相同的 RatingScale

下一篇：python - 使用 SQLAlchemy 版本的 Python Eve 进行验证