我有一个包含时间戳列和能源使用情况的数据框。时间戳取一天中的每一分钟,即每天总共 1440 个读数。我在数据框中几乎没有缺失值。
我想用过去两到三周的同一天同一时间的平均值来估算这些缺失值。这样,如果前一周也缺失,我可以使用两周前的值。
下面是一个数据示例:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
现在我有这行代码:df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
所以它的作用是使用整个数据集一天中同一小时的平均使用量。我希望它更精确,并使用过去两三周的平均值。
最佳答案
您可以 concat
与系列一起 shift
在循环中,因为索引对齐将确保它在前几周与同一小时匹配。然后拿mean
并使用 .fillna
更新原来的
样本数据
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
代码
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000
关于python - 如何使用python中前一周(天)的同一天和时间的值来估算时间序列数据中的缺失值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67874000/