python - 按 Pandas 数据框和条件分组

我的问题是基于这个 thread ，我们在这里对 pandas 数据框的值进行分组，并从每组中选择最新的(按日期):

    id     product   date
0   220    6647     2014-09-01 
1   220    6647     2014-09-03 
2   220    6647     2014-10-16
3   826    3380     2014-11-11
4   826    3380     2014-12-09
5   826    3380     2015-05-19
6   901    4555     2014-09-01
7   901    4555     2014-10-05
8   901    4555     2014-11-01

使用以下内容

df.loc[df.groupby('id').date.idxmax()]

但是，假设我想包括我只想从每个组中选择最新的(按日期)在 +/- 5 天内的条件。即，分组后我想在以下组中找到最新的:

0   220    6647     2014-09-01 #because only these two are within +/- 5 days of each other
1   220    6647     2014-09-03 

2   220    6647     2014-10-16 #spaced more than 5 days apart the above two records

3   826    3380     2014-11-11

.....

产生

    id  product       date
1  220     6647 2014-09-03 
2  220     6647 2014-10-16
3  826     3380 2014-11-11
4  826     3380 2014-12-09
5  826     3380 2015-05-19
5  826     3380 2015-05-19
6  901     4555 2014-09-01
7  901     4555 2014-10-05
8  901     4555 2014-11-01

带价格的数据集:

    id     product   date           price
0   220    6647     2014-09-01      100   #group 1
1   220    6647     2014-09-03      120   #group 1   --> pick this
2   220    6647     2014-09-05      0     #group 1
3   826    3380     2014-11-11      150   #group 2   --> pick this
4   826    3380     2014-12-09      23    #group 3   --> pick this
5   826    3380     2015-05-12      88    #group 4   --> pick this
6   901    4555     2015-05-15      32    #group 4   
7   901    4555     2015-10-05      542   #group 5   --> pick this
8   901    4555     2015-11-01      98    #group 6   --> pick this

最佳答案

我认为您需要通过 apply 创建群组使用 list comprehension 和 between , 然后通过 factorize 转换为数字组，最后将您的解决方案与 loc + idxmax 一起使用:

df['date'] = pd.to_datetime(df['date'])

df = df.reset_index(drop=True)
td = pd.Timedelta('5 days')

def f(x):
    x['g']  = [tuple((x.index[x['date'].between(i - td, i + td)])) for i in x['date']]
    return x

df2 = df.groupby('id').apply(f)
df2['g'] = pd.factorize(df2['g'])[0]
print (df2)
    id  product       date  price  g
0  220     6647 2014-09-01    100  0
1  220     6647 2014-09-03    120  0
2  220     6647 2014-09-05      0  0
3  826     3380 2014-11-11    150  1
4  826     3380 2014-12-09     23  2
5  826     3380 2015-05-12     88  3
6  901     4555 2015-05-15     32  4
7  901     4555 2015-10-05    542  5
8  901     4555 2015-11-01     98  6

df3 = df2.loc[df2.groupby('g')['price'].idxmax()]
print (df3)
    id  product       date  price  g
1  220     6647 2014-09-03    120  0
3  826     3380 2014-11-11    150  1
4  826     3380 2014-12-09     23  2
5  826     3380 2015-05-12     88  3
6  901     4555 2015-05-15     32  4
7  901     4555 2015-10-05    542  5
8  901     4555 2015-11-01     98  6

关于python - 按 Pandas 数据框和条件分组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53775950/

python - 按 Pandas 数据框和条件分组

上一篇：python - 将列转换为字典并访问

下一篇：python - 带有可选参数的适当方法签名