python - 在 pandas Dataframe 中按级别创建计数器，优化

标签 python pandas optimization group-by counter

我有这样一个数据集:

user_id  communication_type  
7         Newsletter
7         Newsletter
7         Newsletter
7         Newsletter
7         Conference
7    Upcoming Events
7    Upcoming Events
7    Upcoming Events
7         Conference
7         Conference
7            Webinar
7          Hackathon

期望的输出:

user_id  communication_type    sent_past
7         Newsletter            0
7         Newsletter            1
7         Newsletter            2
7         Newsletter            3
7         Conference            0
7    Upcoming Events            0 
7    Upcoming Events            1
7    Upcoming Events            2  
7         Conference            1
7         Conference            2 
7            Webinar            0
7          Hackathon            0

基本上为特定 user_id 在每个 communication_type 级别获得一个计数器。

对我有用的解决方案:

train['sent_past'] = train.groupby(['user_id','communication_type']).apply(lambda x: x.reset_index()).index.get_level_values(1)

但它在 ~1m 行上非常慢。我该如何优化它？

最佳答案

您可以使用 groupby.cumcount .

df['sent_past'] = df.groupby('communication_type').cumcount()

print(df)

    user_id communication_type  sent_past
0         7         Newsletter          0
1         7         Newsletter          1
2         7         Newsletter          2
3         7         Newsletter          3
4         7         Conference          0
5         7    Upcoming Events          0
6         7    Upcoming Events          1
7         7    Upcoming Events          2
8         7         Conference          1
9         7         Conference          2
10        7            Webinar          0
11        7          Hackathon          0

或者对于 user_id 和 communication_type 的组合使用:

df['sent_past'] = df.groupby(['user_id', 'communication_type']).cumcount()

关于python - 在 pandas Dataframe 中按级别创建计数器，优化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50542981/

上一篇：python - 获取 python3 中特定字符串部分的数据类型

下一篇：python - 如何在 CVXPY 中取 quad_form 输出的平方根？

Python:如何将大数作为 int64 或 float64 存储在 Pandas 数据框中？

python - Pandas - 如何将 RangeIndex 转换为 DateTimeIndex

python - __repr__ 可以返回数据帧吗？

python - 如何优化这个数据平滑Python循环？

python - 如何使 Flask/Jinja2 加载可执行 zip 存档中的捆绑模板？

python - 如何在xlsxwriter中制作 'MMM DD'的单元格格式

mysql - 如果结果不够，如何动态添加 SELECT 语句？

python - 对 x 轴的值进行排序，以便在计算 f(x) 时均匀地填充绘图 f(x)

php - 优化数据库设计中的表