python - 计算多索引 pandas 数据框中值出现的最快方法

我有两个具有多个级别和列的多索引数据框。我正在寻找迭代数据帧并计数的最快方法，对于每一行，每个数据帧中有多少个单元格高于特定值，然后找到两个数据帧中至少获得一个计数的行的交集。

现在我正在结合使用 for 循环和 groupby 来循环访问数据帧，但是它花了我太多时间来找到正确的答案(我的真实数据帧包含数千个级别和数百个列)，所以我需要找到另一种方法来做到这一点。

例如:

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]],names= 
['index_1','index_2'])
 col = ['column_1', 'column_2']


values_list_a=[[1,2],[2,2],[2,1],[-8,1],[2,0],[2,1]]
DFA = pd.DataFrame(values_list_a, idx, col)

DFA:
                   columns_1 columns2
index_1 index_2
  0       0            1        2
          1            2        2
          2            2        1
  1       0            -8       1
          1            2        0
          2            2        1

values_list_b=[[2,2],[0,1],[2,2],[2,2],[1,0],[1,2]]
DFB = pd.DataFrame(values_list_b, idx, col)

DFB:
                   columns_1 columns2
index_1 index_2
  0       0            2        2
          1            0        1
          2            2        2
  1       0            2        2
          1            1        0
          2            1        2

我的期望是:

第 1 步计算出现次数:

DFA:
                   columns_1 columns2 counts
index_1 index_2
  0       0            1        2       1
          1            2        2       2
          2            2        1       1
  1       0            -8       1       0
          1            2        0       1
          2            2        1       1

DFB:
                   columns_1 columns2 counts
index_1 index_2
  0       0            2        2        2
          1            0        1        0
          2            2        2        2
  1       0            2        2        2
          1            1        0        0
          2            1        2        1

第 2 步:计数 >0 的 2 个数据帧的交集应创建一个像这样的新数据帧(其中记录在同一索引中至少获得一个计数的两个数据帧的行，并添加新的 index_0 级别) 。 index_0 = 0 应该引用 DFA，index_0=1 引用 DFB:

DFC:
                            columns_1 columns2 counts
  index_0 index_1 index_2
     0       0       0            1        2       1
                     2            2        1       1
             1       2            2        1       1

     1       0       0            2        2       2
                     2            2        2       2
             1       2            1        2       1

最佳答案

`pd.concat` 然后`magic`

def f(d, thresh=1):
    c = d.gt(thresh).sum(1)
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    return d.assign(counts=c)[mask]

pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

                         column_1  column_2  counts
index_0 index_1 index_2                            
bar     0       0               1         2       1
                2               2         1       1
        1       2               2         1       1
foo     0       0               2         2       2
                2               2         2       2
        1       2               1         2       1

<小时/>

有评论

def f(d, thresh=1):
    # count how many are greater than a threshold `thresh` per row
    c = d.gt(thresh).sum(1)

    # find where `counts` are > `0` for both dataframes
    # conveniently dropped into one dataframe so we can do
    # this nifty `groupby` trick
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    #                                    \-------/
    #                         This is key to broadcasting over 
    #                         original index rather than collapsing
    #                         over the index levels we grouped by

    #     create a new column named `counts`
    #         /------------\ 
    return d.assign(counts=c)[mask]
    #                         \--/
    #                    filter with boolean mask

# Use concat to smash two dataframes together into one
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

关于python - 计算多索引 pandas 数据框中值出现的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56482811/

python - 计算多索引 pandas 数据框中值出现的最快方法

`pd.concat` 然后`magic`

有评论

上一篇：Python - 根据提供的模式生成文件名

下一篇：python - 如何使矩形在球第一次和第二次击中时改变颜色，然后在第三次击中时使其消失？

python - 计算多索引 pandas 数据框中值出现的最快方法

pd.concat 然后magic

有评论

上一篇：Python - 根据提供的模式生成文件名

下一篇：python - 如何使矩形在球第一次和第二次击中时改变颜色，然后在第三次击中时使其消失？

`pd.concat` 然后`magic`