我正在使用 pandas 数据框,并将趋势应用于最早的数据点,以尝试尽可能地填充缺失的历史数据。我知道遍历 pandas 数据框是错误的,但我还没有找到替代方法来执行此操作,因为新值取决于下一个值。如果有人知道实现此目标的更好方法,那就太好了!
例子 df:
Week no Data Trend
0 1 0.0 1.5
1 2 0.0 1.5
2 3 0.0 1.0
3 4 0.0 0.5
4 5 10.0 0.6
我目前使用的代码:
for wk in range(len(df)-1, 0, -1):
if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
and not math.isnan(df.loc[wk, 'Trend'])):
df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
*df.loc[wk, 'Trend'])
结果:
Week no Data Trend
0 1 4.5 1.5
1 2 3.0 1.5
2 3 3.0 1.0
3 4 6.0 0.5
4 5 10.0 0.6
最佳答案
递归计算不可向量化,使用提高性能 numba :
from numba import jit
@jit(nopython=True)
def f(a, b):
for i in range(a.shape[0]-1, 0, -1):
if (a[i] != 0) and (a[i-1] == 0) and not np.isnan(b[i]):
a[i-1] = a[i] * b[i]
return a
df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
print (df)
Week no Data Trend
0 1 4.5 1.5
1 2 3.0 1.5
2 3 3.0 1.0
3 4 6.0 0.5
4 5 10.0 0.6
第一次测试没有缺失值,如样本中的数据:
df = pd.concat([df] * 40, ignore_index=True)
print (df)
Week no Data Trend
0 0 1 4.5 1.5
1 1 2 3.0 1.5
2 2 3 3.0 1.0
3 3 4 6.0 0.5
4 4 5 10.0 0.6
.. ... .. ... ...
195 0 1 4.5 1.5
196 1 2 3.0 1.5
197 2 3 3.0 1.0
198 3 4 6.0 0.5
199 4 5 10.0 0.6
[200 rows x 4 columns]
In [114]: %%timeit
...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
...:
...:
121 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)
In [115]: %%timeit
...: for wk in range(len(df)-1, 0, -1):
...: if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
...: and not math.isnan(df.loc[wk, 'Trend'])):
...: df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
...: *df.loc[wk, 'Trend'])
...:
3.3 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我用 2 * 40 个缺失值进行测试,性能相似:
print (df)
Week no Data Trend
0 0 1 0.0 NaN
1 1 2 0.0 NaN
2 2 3 0.0 1.0
3 3 4 0.0 0.5
4 4 5 10.0 0.6
df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)
In [117]: %%timeit
...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
...:
119 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)
In [121]: %%timeit
...: for wk in range(len(df)-1, 0, -1):
...: if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
...: and not math.isnan(df.loc[wk, 'Trend'])):
...: df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
...: *df.loc[wk, 'Trend'])
...:
3.12 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
关于python - 根据下一个值更新 Pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64533816/