我有这样一个数据集:
user_id communication_type
7 Newsletter
7 Newsletter
7 Newsletter
7 Newsletter
7 Conference
7 Upcoming Events
7 Upcoming Events
7 Upcoming Events
7 Conference
7 Conference
7 Webinar
7 Hackathon
期望的输出:
user_id communication_type sent_past
7 Newsletter 0
7 Newsletter 1
7 Newsletter 2
7 Newsletter 3
7 Conference 0
7 Upcoming Events 0
7 Upcoming Events 1
7 Upcoming Events 2
7 Conference 1
7 Conference 2
7 Webinar 0
7 Hackathon 0
基本上为特定 user_id 在每个 communication_type 级别获得一个计数器。
对我有用的解决方案:
train['sent_past'] = train.groupby(['user_id','communication_type']).apply(lambda x: x.reset_index()).index.get_level_values(1)
但它在 ~1m 行上非常慢。我该如何优化它?
最佳答案
您可以使用 groupby.cumcount
.
df['sent_past'] = df.groupby('communication_type').cumcount()
print(df)
user_id communication_type sent_past
0 7 Newsletter 0
1 7 Newsletter 1
2 7 Newsletter 2
3 7 Newsletter 3
4 7 Conference 0
5 7 Upcoming Events 0
6 7 Upcoming Events 1
7 7 Upcoming Events 2
8 7 Conference 1
9 7 Conference 2
10 7 Webinar 0
11 7 Hackathon 0
或者对于 user_id
和 communication_type
的组合使用:
df['sent_past'] = df.groupby(['user_id', 'communication_type']).cumcount()
关于python - 在 pandas Dataframe 中按级别创建计数器,优化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50542981/