Python Pandas 不正确的日期计数

使用以下 python pandas 数据框“df”:

Customer_ID | Transaction_ID  | Item_ID
ABC           2017-04-12-333    X8973
ABC           2017-04-12-333    X2468
ABC           2017-05-22-658    X2906
ABC           2017-05-22-757    X8790
ABC           2017-07-13-864    X8790     
BCD           2017-08-11-879    X2346
BCD           2017-08-11-879    X2468

我想按日期计算客户的第一笔交易、第二笔交易等列中表示的交易。 (如果同一天有两笔交易，我把它们算作同一次交易，因为我没有时间，所以我不知道哪一笔先到——基本上把它们当作一笔交易)。

#get the date out of the Transaction_ID string
df['date'] = pd.to_datetime(df.Transaction_ID.str[:10])

#calculate the transaction number
df['trans_nr'] = df.groupby(['Customer_ID',"Transaction_ID", df['date'].dt.year]).cumcount()+1

不幸的是，这是我上面代码的输出:

Customer_ID | Transaction_ID  | Item_ID | date        | trans_nr
ABC           2017-04-12-333    X8973     2017-04-12     1
ABC           2017-04-12-333    X2468     2017-04-12     2
ABC           2017-05-22-658    X2906     2017-05-22     1
ABC           2017-05-22-757    X8790     2017-05-22     1
ABC           2017-07-13-864    X8790     2017-07-13     1
BCD           2017-08-11-879    X2346     2017-08-11     1
BCD           2017-08-11-879    X2468     2017-08-11     2

这是不正确的，这是我正在寻找的正确输出:

Customer_ID | Transaction_ID  | Item_ID | date        | trans_nr
ABC           2017-04-12-333    X8973     2017-04-12     1
ABC           2017-04-12-333    X2468     2017-04-12     1
ABC           2017-05-22-658    X2906     2017-05-22     2
ABC           2017-05-22-757    X8790     2017-05-22     2
ABC           2017-07-13-864    X8790     2017-07-13     3
BCD           2017-08-11-879    X2346     2017-08-11     1
BCD           2017-08-11-879    X2468     2017-08-11     1

也许逻辑应该只基于 Customer_ID 和日期(没有 Transaction_ID)？

我试过了

df['trans_nr'] = df.groupby(['Customer_ID','date').cumcount()+1

但它也算错了。

最佳答案

让我们试试:

df['trans_nr'] = df.groupby(['Customer_ID', df['date'].dt.year])['date']\
                   .transform(lambda x: (x.diff() != pd.Timedelta('0 days')).cumsum())

输出:

 Customer_ID  Transaction_ID Item_ID       date  trans_nr
0         ABC  2017-04-12-333   X8973 2017-04-12         1
1         ABC  2017-04-12-333   X2468 2017-04-12         1
2         ABC  2017-05-22-658   X2906 2017-05-22         2
3         ABC  2017-05-22-757   X8790 2017-05-22         2
4         ABC  2017-07-13-864   X8790 2017-07-13         3
5         BCD  2017-08-11-879   X2346 2017-08-11         1
6         BCD  2017-08-11-879   X2468 2017-08-11         1

关于Python Pandas 不正确的日期计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46792634/

Python Pandas 不正确的日期计数

上一篇：python - Pandas 获得高于组中位数的组数

下一篇：python - 具有不同形状的 Tensorflow 数据集