我有一个如下所示的 pandas 数据框:

import pandas as pd

pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
              "BBB":["y1","y1","y2","y2","y2","y1"],
              "CCC":["t1","t2","t3","t1","t1","t1"],
              "DDD":[10,11,18,17,21,30]})

Out[1]:
  AAA BBB CCC  DDD
0  x1  y1  t1   10
1  x1  y1  t2   11
2  x1  y2  t3   18
3  x2  y2  t1   17
4  x2  y2  t1   21
5  x2  y1  t1   30

问题

我想要的是对 AAA 列进行分组，这样我就有 2 个组 - x1、x2。

我想计算每个组的 BBB 列中 y1 与 y2 的比率。

并将此输出分配给新列BBB比率

所需的输出

所以我想把它作为我的输出。

pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
              "BBB":["y1","y1","y2","y2","y2","y1"],
              "CCC":["t1","t2","t3","t1","t1","t1"],
              "DDD":[10,11,18,17,21,30],
              "Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})

Out[2]:
  AAA BBB CCC  DDD  Ratio of BBB
0  x1  y1  t1   10          0.33
1  x1  y1  t2   11          0.33
2  x1  y2  t3   18          0.33
3  x2  y2  t1   17          0.66
4  x2  y2  t1   21          0.66
5  x2  y1  t1   30          0.66

当前状态

我目前已经实现了这样的目标:

def f(df):
  df["y1"] = sum(df["BBB"] == "y1")
  df["y2"] = sum(df["BBB"] == "y2")
  df["Ratio of BBB"] = df["y2"] / df["y1"]
  return df

df.groupby(df.AAA).apply(f)

我想要实现的目标

是否可以使用 .pipe() 函数来实现此目的？

我在想这样的事情:

df = (df
 .groupby(df.AAA) # groupby a column not included in the current series (df.colname)
 .BBB
 .value_counts()
 .pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
 )

编辑:使用`pipe()`的一种解决方案

注意:用户 jpp下面发表了明确的评论:

unstack / merge / reset_index operations are unnecessary and expensive

但是，我最初打算使用这个方法，我想我会在这里分享它!

df = (df
      .groupby(df.AAA)                     # groupby the column
      .BBB                                 # select the column with values to calculate ('BBB' with y1 & y2)
      .value_counts()                      # calculate the values (# of y1 per group, # of y2 per group)
      .unstack()                           # turn the rows into columns (y1, y2)
      .pipe(lambda df: df["y1"]/df["y2"])  # calculate the ratio of y1:y2 (outputs a Series)
      .rename("ratio")                     # rename the series 'ratio' so it will be ratio column in output df
      .reset_index()                       # turn the groupby series into a dataframe
      .merge(df)                           # merge with the original dataframe filling in the columns with the key (AAA)
      )

最佳答案

看起来您想要的是 y1 与总数的比率。使用groupby + value_counts:

v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))

  AAA BBB CCC  DDD     RATIO
0  x1  y1  t1   10  0.333333
1  x1  y1  t2   11  0.333333
2  x1  y2  t3   18  0.333333
3  x2  y2  t1   17  0.666667
4  x2  y2  t1   21  0.666667
5  x2  y1  t1   30  0.666667

要概括许多组，您可以使用

df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))

关于python - 每组的 pandas 计算两个类别的比率，并使用 .pipe() 作为新列附加到数据框，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50892309/

python - 每组的 pandas 计算两个类别的比率，并使用 .pipe() 作为新列附加到数据框

问题

所需的输出

当前状态

我想要实现的目标

编辑:使用`pipe()`的一种解决方案

上一篇： python 3 : how do I run a function over a list of lists of possible parameters and returning a dictionary of all results

下一篇：Python - 迭代列表字典

python - 每组的 pandas 计算两个类别的比率，并使用 .pipe() 作为新列附加到数据框

问题

所需的输出

当前状态

我想要实现的目标

编辑:使用pipe()的一种解决方案

上一篇： python 3 : how do I run a function over a list of lists of possible parameters and returning a dictionary of all results

下一篇：Python - 迭代列表字典

编辑:使用`pipe()`的一种解决方案