我正在尝试在分组的 DataFrame 中进行上采样,但不确定如何使其仅在组内进行上采样。我有一个看起来像这样的数据框:
cat weekstart date
0.0 2016-07-04 00:00:00+00:00 2016-07-04 1
2016-07-06 1
2016-07-07 2
2016-08-15 00:00:00+00:00 2016-08-16 1
2016-08-19 1
2016-09-19 00:00:00+00:00 2016-09-20 1
2016-09-21 1
2016-12-19 00:00:00+00:00 2016-12-19 1
2016-12-21 1
1.0 2016-07-25 00:00:00+00:00 2016-07-26 2
2016-08-01 00:00:00+00:00 2016-08-03 1
2016-08-08 00:00:00+00:00 2016-08-12 1
如果我做类似 df.unstack().fillna(0).stack() 的事情会导致:
cat weekstart date
0.0 2016-07-04 00:00:00+00:00 2016-1-1 0
.
.
.
2016-07-04 1
2016-07-06 1
2016-07-07 2
因为日期列中的最小值是 2016-1-1。不过,我所追求的只是在每个“猫”和“周开始”内采样工作日,例如:
cat weekstart date
0.0 2016-07-04 00:00:00+00:00 2016-07-04 1
2016-07-05 0
2016-07-06 1
2016-07-07 2
2016-07-8 0
2016-08-15 00:00:00+00:00 2016-08-15 0
2016-08-16 1
2016-08-17 0
2016-08-18 0
2016-08-19 1
我尝试过使用:
level_values = df.index.get_level_values
df.groupby(
[level_values(i) for i in [0, 1]] + [pd.Grouper('B', level=-1)]
)
.sum()
但它没有按预期工作。
最佳答案
我认为您需要使用 reindex
的自定义功能由 bdate_range
创建的 MultiIndex
:
def f(x):
lvl0 = x.index.get_level_values(0)[0]
lvl1 = x.index.get_level_values(1)[0]
lvl2 = pd.bdate_range(start=lvl1, periods=5)
mux = pd.MultiIndex.from_product([[lvl0], [lvl1], lvl2], names=x.index.names)
return (x.reindex(mux, fill_value=0))
s1 = s.groupby(['cat','weekstart'], group_keys=False).apply(f)
<小时/>
print (s1)
cat weekstart date
0.0 2016-07-04 2016-07-04 1
2016-07-05 0
2016-07-06 1
2016-07-07 2
2016-07-08 0
2016-08-15 2016-08-15 0
2016-08-16 1
2016-08-17 0
2016-08-18 0
2016-08-19 1
2016-09-19 2016-09-19 0
2016-09-20 1
2016-09-21 1
2016-09-22 0
2016-09-23 0
2016-12-19 2016-12-19 1
2016-12-20 0
2016-12-21 1
2016-12-22 0
2016-12-23 0
1.0 2016-07-25 2016-07-25 0
2016-07-26 2
2016-07-27 0
2016-07-28 0
2016-07-29 0
2016-08-01 2016-08-01 0
2016-08-02 0
2016-08-03 1
2016-08-04 0
2016-08-05 0
2016-08-08 2016-08-08 0
2016-08-09 0
2016-08-10 0
2016-08-11 0
2016-08-12 1
Name: a, dtype: int64
设置:
d = {(0.0, pd.Timestamp('2016-07-04 00:00:00'), pd.Timestamp('2016-07-07 00:00:00')): 2, (1.0, pd.Timestamp('2016-07-25 00:00:00'), pd.Timestamp('2016-07-26 00:00:00')): 2, (0.0, pd.Timestamp('2016-08-15 00:00:00'), pd.Timestamp('2016-08-16 00:00:00')): 1, (0.0, pd.Timestamp('2016-07-04 00:00:00'), pd.Timestamp('2016-07-04 00:00:00')): 1, (0.0, pd.Timestamp('2016-09-19 00:00:00'), pd.Timestamp('2016-09-20 00:00:00')): 1, (0.0, pd.Timestamp('2016-09-19 00:00:00'), pd.Timestamp('2016-09-21 00:00:00')): 1, (0.0, pd.Timestamp('2016-12-19 00:00:00'), pd.Timestamp('2016-12-19 00:00:00')): 1, (1.0, pd.Timestamp('2016-08-08 00:00:00'), pd.Timestamp('2016-08-12 00:00:00')): 1, (0.0, pd.Timestamp('2016-07-04 00:00:00'), pd.Timestamp('2016-07-06 00:00:00')): 1, (1.0, pd.Timestamp('2016-08-01 00:00:00'), pd.Timestamp('2016-08-03 00:00:00')): 1, (0.0, pd.Timestamp('2016-12-19 00:00:00'), pd.Timestamp('2016-12-21 00:00:00')): 1, (0.0, pd.Timestamp('2016-08-15 00:00:00'), pd.Timestamp('2016-08-19 00:00:00')): 1}
s = pd.Series(d).rename_axis(['cat','weekstart','date'])
print (s)
cat weekstart date
0.0 2016-07-04 2016-07-04 1
2016-07-06 1
2016-07-07 2
2016-08-15 2016-08-16 1
2016-08-19 1
2016-09-19 2016-09-20 1
2016-09-21 1
2016-12-19 2016-12-19 1
2016-12-21 1
1.0 2016-07-25 2016-07-26 2
2016-08-01 2016-08-03 1
2016-08-08 2016-08-12 1
dtype: int64
关于python - pandas 多索引中的上采样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48665651/