我的问题是基于这个 thread ,我们在这里对 pandas 数据框的值进行分组,并从每组中选择最新的(按日期):
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
使用以下内容
df.loc[df.groupby('id').date.idxmax()]
但是,假设我想包括我只想从每个组中选择最新的(按日期)在 +/- 5 天内的条件。即,分组后我想在以下组中找到最新的:
0 220 6647 2014-09-01 #because only these two are within +/- 5 days of each other
1 220 6647 2014-09-03
2 220 6647 2014-10-16 #spaced more than 5 days apart the above two records
3 826 3380 2014-11-11
.....
产生
id product date
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
带价格的数据集:
id product date price
0 220 6647 2014-09-01 100 #group 1
1 220 6647 2014-09-03 120 #group 1 --> pick this
2 220 6647 2014-09-05 0 #group 1
3 826 3380 2014-11-11 150 #group 2 --> pick this
4 826 3380 2014-12-09 23 #group 3 --> pick this
5 826 3380 2015-05-12 88 #group 4 --> pick this
6 901 4555 2015-05-15 32 #group 4
7 901 4555 2015-10-05 542 #group 5 --> pick this
8 901 4555 2015-11-01 98 #group 6 --> pick this
最佳答案
我认为您需要通过 apply
创建群组使用 list comprehension
和 between
, 然后通过 factorize
转换为数字组,最后将您的解决方案与 loc + idxmax
一起使用:
df['date'] = pd.to_datetime(df['date'])
df = df.reset_index(drop=True)
td = pd.Timedelta('5 days')
def f(x):
x['g'] = [tuple((x.index[x['date'].between(i - td, i + td)])) for i in x['date']]
return x
df2 = df.groupby('id').apply(f)
df2['g'] = pd.factorize(df2['g'])[0]
print (df2)
id product date price g
0 220 6647 2014-09-01 100 0
1 220 6647 2014-09-03 120 0
2 220 6647 2014-09-05 0 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6
df3 = df2.loc[df2.groupby('g')['price'].idxmax()]
print (df3)
id product date price g
1 220 6647 2014-09-03 120 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6
关于python - 按 Pandas 数据框和条件分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53775950/