python - 每组的 pandas 计算两个类别的比率,并使用 .pipe() 作为新列附加到数据框

标签 python pandas pipe pandas-groupby

我有一个如下所示的 pandas 数据框:

import pandas as pd

pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
              "BBB":["y1","y1","y2","y2","y2","y1"],
              "CCC":["t1","t2","t3","t1","t1","t1"],
              "DDD":[10,11,18,17,21,30]})

Out[1]:
  AAA BBB CCC  DDD
0  x1  y1  t1   10
1  x1  y1  t2   11
2  x1  y2  t3   18
3  x2  y2  t1   17
4  x2  y2  t1   21
5  x2  y1  t1   30

问题

我想要的是对 AAA 列进行分组,这样我就有 2 个组 - x1x2

我想计算每个组的 BBB 列中 y1y2 的比率。

并将此输出分配给新列BBB比率

所需的输出

所以我想把它作为我的输出。

pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
              "BBB":["y1","y1","y2","y2","y2","y1"],
              "CCC":["t1","t2","t3","t1","t1","t1"],
              "DDD":[10,11,18,17,21,30],
              "Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})

Out[2]:
  AAA BBB CCC  DDD  Ratio of BBB
0  x1  y1  t1   10          0.33
1  x1  y1  t2   11          0.33
2  x1  y2  t3   18          0.33
3  x2  y2  t1   17          0.66
4  x2  y2  t1   21          0.66
5  x2  y1  t1   30          0.66

当前状态

我目前已经实现了这样的目标:

def f(df):
  df["y1"] = sum(df["BBB"] == "y1")
  df["y2"] = sum(df["BBB"] == "y2")
  df["Ratio of BBB"] = df["y2"] / df["y1"]
  return df

df.groupby(df.AAA).apply(f)

我想要实现的目标

是否可以使用 .pipe() 函数来实现此目的?

我在想这样的事情:

df = (df
 .groupby(df.AAA) # groupby a column not included in the current series (df.colname)
 .BBB
 .value_counts()
 .pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
 )

编辑:使用pipe()的一种解决方案

注意:用户 jpp下面发表了明确的评论:

unstack / merge / reset_index operations are unnecessary and expensive

但是,我最初打算使用这个方法,我想我会在这里分享它!

df = (df
      .groupby(df.AAA)                     # groupby the column
      .BBB                                 # select the column with values to calculate ('BBB' with y1 & y2)
      .value_counts()                      # calculate the values (# of y1 per group, # of y2 per group)
      .unstack()                           # turn the rows into columns (y1, y2)
      .pipe(lambda df: df["y1"]/df["y2"])  # calculate the ratio of y1:y2 (outputs a Series)
      .rename("ratio")                     # rename the series 'ratio' so it will be ratio column in output df
      .reset_index()                       # turn the groupby series into a dataframe
      .merge(df)                           # merge with the original dataframe filling in the columns with the key (AAA)
      )

最佳答案

看起来您想要的是 y1 与总数的比率。使用groupby + value_counts:

v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))

  AAA BBB CCC  DDD     RATIO
0  x1  y1  t1   10  0.333333
1  x1  y1  t2   11  0.333333
2  x1  y2  t3   18  0.333333
3  x2  y2  t1   17  0.666667
4  x2  y2  t1   21  0.666667
5  x2  y1  t1   30  0.666667

要概括许多组,您可以使用

df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))

关于python - 每组的 pandas 计算两个类别的比率,并使用 .pipe() 作为新列附加到数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50892309/

相关文章:

pandas - 如何并排绘制2个直方图?

python - 使用 python 中 statsmodels 的 ExponentialSmoothing 通过三重指数平滑进行预测

powershell - Powershell:将管道发送为命令的参数

python - PyQt 静态构建在 make 时失败

Python:通过变量访问类属性

python - 我在异常处理中有一个不同的异常?

python - 如何在 SQLite 表中存储 Python 函数?

python-3.x - 计算同一组中有多少行在 Pandas DataFrame 中的每一行的给定列中具有较大的值

c++ - 当子进程不刷新其标准输出时如何从子进程读取标准输出?

windows - 如何在 Windows 上读取 Perl 中的管道输入?