python - 在值(value)条件下,有没有更快的方法来计算带有 pandas 的 groupby 对象的历史比率?

标签 python pandas pandas-groupby

这里有两个玩家的数据帧示例以及预期输出的说明:

+--------+------------+-----------------------------------+
| Player |   Result   |    Winning ratio (historical)     |
+--------+------------+-----------------------------------+
| K2000  | Lose       | 0% #first game so no hist         |
| K2000  | Lose       | 0% #0 game winned on 1 contested  |
| K2000  | Win        | 0% #0 game winned on 2 contested  |
| K2000  | Not ranked | 33% #1 game winned on 3 contested |
| K2000  | Lose       | 25% #and so on.                   |
| K2000  | Win        | 20%                               |
| K2000  | Win        | 33%                               |
| Kssis  | Win        | 0%                                |
| Kssis  | Win        | 100%                              |
| Kssis  | Not ranked | 100%                              |
| Kssis  | Lose       | 66%                               |
| Kssis  | Win        | 50%                               |
+--------+------------+-----------------------------------+

为了获得它,我做了以下操作

df['sucess'] = df.apply(lambda row: 1 if row['result'] == 'Win' else 0, axis = 1)
df['nb_of_contests'] = df.apply(lambda row: 1 , axis = 1)
#gives
+--------+------------+--------+----------------+
| Player |   Result   | Sucess | Nb_of_contests |
+--------+------------+--------+----------------+
| K2000  | Lose       |      0 |              1 |
| K2000  | Lose       |      0 |              1 |
| K2000  | Win        |      1 |              1 |
| K2000  | Not ranked |      0 |              1 |
| K2000  | Lose       |      0 |              1 |
| K2000  | Win        |      1 |              1 |
| K2000  | Win        |      1 |              1 |
| Kssis  | Win        |      1 |              1 |
| Kssis  | Win        |      1 |              1 |
| Kssis  | Not ranked |      0 |              1 |
| Kssis  | Lose       |      0 |              1 |
| Kssis  | Win        |      1 |              1 |
+--------+------------+--------+----------------+

#then the sums cumulated
cumul = df.groupby('Player')['sucess','nb_of_contests'].cumsum()
#cumul gives
+--------+------------+--------+----------------+
| Player |   Result   | Sucess | Nb_of_contests |
+--------+------------+--------+----------------+
| K2000  | Lose       |      0 |              1 |
| K2000  | Lose       |      0 |              2 |
| K2000  | Win        |      1 |              3 |
| K2000  | Not ranked |      0 |              4 |
| K2000  | Lose       |      0 |              5 |
| K2000  | Win        |      2 |              6 |
| K2000  | Win        |      3 |              7 |
| Kssis  | Win        |      1 |              1 |
| Kssis  | Win        |      2 |              2 |
| Kssis  | Not ranked |      0 |              3 |
| Kssis  | Lose       |      0 |              4 |
| Kssis  | Win        |      3 |              5 |
+--------+------------+--------+----------------+

#then compute the ratio
winning_ratio = cumul['sucess']/cumul['nb_of_contests']
#finnaly shift
gb_winning_ratio = winning_ratio.groupby('Player') #in order to shift inside group, because cumul is a dataframe not a groupby object.
winning_ratio_shifted = gb_winning_ratio.shift(1)

那么,有没有更简单的方法呢?因为这里我认为这是可以简化的,但我没有足够的技能来改进它。因此,请毫不犹豫地给出深入的解释。我首先想掌握它。

Pandas 版本:0.23.4 Python 版本:3.7.4

最佳答案

通知:

避免:

ValueError: cannot reindex from a duplicate axis

创建默认RangeIndex:

df = df.reset_index(drop=True)

然后使用:

df['sucess'] = (df['Result'] == 'Win').astype(int)
df['nb_of_contests'] = 1

cumul = df.groupby('Player')['sucess','nb_of_contests'].cumsum()
winning_ratio = cumul['sucess'].div(cumul['nb_of_contests'])

winning_ratio_shifted = winning_ratio.groupby(df['Player']).shift().fillna(0)

print (winning_ratio_shifted)
0     0.000000
1     0.000000
2     0.000000
3     0.333333
4     0.250000
5     0.200000
6     0.333333
7     0.000000
8     1.000000
9     1.000000
10    0.666667
11    0.500000
dtype: float64

或者您可以使用 DataFrame.assign 的一行解决方案每组带有链 cumsumshift:

winning_ratio_shifted = (df.assign(sucess = (df['Result'] == 'Win').astype(int), 
                                   nb_of_contests = 1)
                          .groupby('Player')['sucess','nb_of_contests']
                          .apply(lambda x: x.cumsum().shift())
                          .assign(new=lambda x: x['sucess'] / x['nb_of_contests'])['new']
                          .fillna(0)
                        )

print (winning_ratio_shifted)

1     0.000000
2     0.000000
3     0.333333
4     0.250000
5     0.200000
6     0.333333
7     0.000000
8     1.000000
9     1.000000
10    0.666667
11    0.500000
Name: new, dtype: float64

关于python - 在值(value)条件下,有没有更快的方法来计算带有 pandas 的 groupby 对象的历史比率?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58913240/

相关文章:

python - 通过第二个索引访问 pandas groupby multiindex

python - 从文件中存储和检索列表

python - django.db.utils.OperationalError : 1005, 'Can' t创建表 `xyz` .`#sql-600_237`(错误号:150 "Foreign key constraint is incorrectly formed")

python - 如何在Python和C/C++中使用共享内存

python - 按一定顺序排序(情况: pandas DataFrame Groupby)

sqlite - Pandas/iPython 笔记本(Jupyter)中 DataFrame/table 中的 GROUP BY 行?

python - PyMySQL 插入 NULL 或字符串

python - pandas 计算中的最小值

python - 字典键内数据帧的外部合并

python - 检查 GROUP BY 和列之间的值