python - 使用 Pandas 聚合所有数据框行对组合

标签 python pandas aggregate combinations python-itertools

我使用 python pandas 跨数据帧执行分组和聚合，但我现在想执行特定的行成对聚合(n 选择 2，统计组合)。这是示例数据，我想在其中查看 [mygenes] 中的所有基因对:

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

最终产品应如下所示(默认应用 np.sum 即可):

                 case1    case2    control1    control2
'ABC1', 'ABC2'    1         2         0            1
'ABC1', 'ABC3'    1         2         1            1
'ABC1', 'ABC4'    0         1         1            2
'ABC2', 'ABC3'    2         2         1            0
'ABC2', 'ABC4'    1         1         1            1
'ABC3', 'ABC4'    1         1         2            1

可以使用 itertools ($itertools.combinations(mygenes, 2)) 轻松获得基因对集，但我不知道如何执行特定<的聚合/strong> 行基于它们的值。谁能建议？谢谢

最佳答案

我想不出一个聪明的矢量化方法来做到这一点，但除非性能是真正的瓶颈，否则我倾向于使用最简单的有意义的方法。在这种情况下，我可能会 set_index("Gene") 然后使用 loc 来挑选行:

>>> df = df.set_index("Gene") >>> cc = list(combinations(mygenes,2)) >>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc) >>> out case1 case2 control1 control2 (ABC1, ABC2) 1 2 0 1 (ABC1, ABC3) 1 2 1 1 (ABC1, ABC4) 0 1 1 2 (ABC2, ABC3) 2 2 1 0 (ABC2, ABC4) 1 1 1 1 (ABC3, ABC4) 1 1 2 1

关于python - 使用 Pandas 聚合所有数据框行对组合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29777702/

上一篇：python - 如何在 Django 模型上将月份表示为字段

下一篇：python - pandas: read_csv 如何强制 bool 数据 dtype bool 而不是对象

相关文章：

r - R中的聚合和加权平均值

t-sql - 我可以将 PARTITION BY 与 GROUP BY 子句一起使用吗？ SQL Server 2012

python - 输入二维矩阵的每个元素的最短方法

Python/Matplotlib : Randomly select "sample" scatter points for different marker

python - 匹配两个表(明细表到小计表)，同时识别明细表中不匹配的项目

python - 聚合数据并获取总和和计数

python - 如何使用 PUT 在 Django 休息框架中测试文件上传？

python - 如何在python中获取可用内核数

sqlite - Pandas/iPython 笔记本(Jupyter)中 DataFrame/table 中的 GROUP BY 行？

python - 将带有逗号空格逗号的 CSV 转换为浮点型