我正在尝试计算 pandas 数据框中时间序列的索引值。该索引取决于前一行的结果来计算第一次迭代后的每一行。我尝试在数据帧行的迭代中递归地执行此操作,但我发现计算的前两行是正确的,但第三行和后续行不准确。
我认为这是因为在初始值之后,后续的索引计算出错,然后将所有其他后续计算设置为错误。
是什么导致了这种不准确。有没有比我所采取的方法更好的方法?
输出示例如下所示:
ticket_cat Sector Year factor Incorrect_index_value correct_index_value prev_row
Revenue LSE Jan 2004 100.00 100.00
Revenue LSE Jan 2005 4.323542894 104.3235 104.3235 100.00
Revenue LSE Jan 2006 3.096308080 98.823 107.5537 <--incorrect row
Revenue LSE Jan 2007 6.211666 107.476 114.2345 <--incorrect row
Revenue LD Jan 2004 100.00 100.0000
Revenue LD Jan 2005 3.5218 103.5218 103.5218
Revenue LD Jan 2006 2.7417 99.2464 106.3602 <--- incorrect row
Revenue LD Jan 2007 3.3506 104.1353 109.9239 <--- incorrect row
我的代码片段如下:stpassrev是数据帧
#insert initial value for index
stpassrev['index_value'] = np.where(
(stpassrev['Year'] == 'Jan 2004' ) & (stpassrev['Ticket_cat']=='Revenue'),
100.00,np.nan )
#set up initial values for prec_row column
stpassrev['prev_row'] = np.where(
#only have relevant row impacted
(stpassrev['Year'] == 'Jan 2005' ) & (stpassrev['Ticke_cat']=='Revenue'),
100.00,
np.nan
)
#calculate the index_value
for i in range(1,len(stpassrev)):
stpassrev.loc[i,'passrev'] = np.where(
(stpassrev.loc[i,'Ticket_cat']=='Revenue' ) & (pd.isna(stpassrev.loc[i,'factor'])==False),
((100+stpassrev.loc[i,'factor'] ) /stpassrev.loc[i-1,'index_value'])*100,
stpassrev.loc[i,'index_value'])
stpassrev.loc[i,'prev_row'] = stpassrev.loc[i-1,'index_value']
最佳答案
根据您更新的问题,您只需执行以下操作:
# assign a new temp_factor with initial values and prep for cumprod
stpassrev['temp_factor'] = np.where(stpassrev['factor'].isna(), 1, stpassrev['factor'].add(100).div(100))
# calculate the cumprod based on the temp_factor (grouped by Sector) and multiply by 100 for index_value
stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100)
结果:
ticket_cat Sector Year factor temp_factor index_value
0 Revenue LSE Jan 2004 NaN 1.000000 100.000000
1 Revenue LSE Jan 2005 4.323543 1.043235 104.323543
2 Revenue LSE Jan 2006 3.096308 1.030963 107.553721
3 Revenue LSE Jan 2007 6.211666 1.062117 114.234599
4 Revenue LD Jan 2004 NaN 1.000000 100.000000
5 Revenue LD Jan 2005 3.521800 1.035218 103.521800
6 Revenue LD Jan 2006 2.741700 1.027417 106.360057
7 Revenue LD Jan 2007 3.350600 1.033506 109.923757
如果需要四舍五入到 4 位精度,请在 .mul(100)
之后添加 .round(4)
:
stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100).round(4)
ticket_cat Sector Year factor temp_factor index_value
0 Revenue LSE Jan 2004 NaN 1.000000 100.0000
1 Revenue LSE Jan 2005 4.323543 1.043235 104.3235
2 Revenue LSE Jan 2006 3.096308 1.030963 107.5537
3 Revenue LSE Jan 2007 6.211666 1.062117 114.2346
4 Revenue LD Jan 2004 NaN 1.000000 100.0000
5 Revenue LD Jan 2005 3.521800 1.035218 103.5218
6 Revenue LD Jan 2006 2.741700 1.027417 106.3601
7 Revenue LD Jan 2007 3.350600 1.033506 109.9238
关于python - 如何在 pandas 数据框列上最好地执行递归,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58916783/