python - Pandas 总和高于所有可能的阈值

我有一个数据集，其中包含两个风险模型评分和具有一定值(value)的观察结果。像这样的事情:

import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
                      'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
                      'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
                      'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
                      columns=['segment','model1','model2','dollars'])
print df

   segment  model1  model2  dollars
0        A       9       9       15
1        A       4       8       10
2        A       5       2       -5
3        A       2       4       -7
4        A       9       6        6
5        A       7       8        7
6        A       7       8       -2
7        B       8       7        5
8        B       8       7        7
9        B       5       7        3
10       B       6       4       -1
11       B       3       4       -3

我的目标是确定值(value)最大化的同时风险模型阈值，即像 (model1 >= X) & (model2 >= Y) 这样的截止值。风险模型都是按等级排序的，数字越大风险越低，值(value)通常越高。

我能够使用循环方法获得所需的输出:

df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0

#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
    #subset the original df down to just the observations above the given model scores
    df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
    #calculate the sum and add it back to the dataframe
    df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()

#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]

  segment  model1  model2  dollar_sum  threshold_sum
1       A       4       8          10             30
7       B       5       7           3             15

随着数据帧大小的增加，循环运行速度极其缓慢。我确信有一种更快的方法可以做到这一点(也许使用 cumsum() 或 numpy)，但我对它是什么感到困惑。有人有更好的方法吗？理想情况下，任何代码都可以轻松扩展到 n 个风险模型，并输出所有可能的 threshold_sum 组合，以防我将来添加其他优化标准。

最佳答案

使用相同的方法，您将通过 apply() 获得一些加速，但我同意您的直觉，可能有更快的方法。
这是一个 apply() 解决方案:

使用df_sum为:

df_sum = (df.groupby(['segment','model1','model2'])
            .dollars
            .sum()
            .reset_index()
         )

print(df_sum)
  segment  model1  model2  dollars
0       A       2       4       -7
1       A       4       8       10
2       A       5       2       -5
3       A       7       8        5
4       A       9       6        6
5       A       9       9       15
6       B       3       4       -3
7       B       5       7        3
8       B       6       4       -1
9       B       8       7       12

apply 可以与 groupby 结合使用:

def get_threshold_sum(row):
    return (df.loc[(df.segment == row.segment) & 
                   (df.model1 >= row.model1) & 
                   (df.model2 >= row.model2), 
                   ["segment","dollars"]]
              .groupby('segment')
              .sum()
              .dollars
           )

thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()

df_sum.loc[mask]
  segment  model1  model2  dollar_sum
1       A       4       8          10
7       B       5       7           3

要查看所有可能的阈值，只需打印阈值列表即可。

关于python - Pandas 总和高于所有可能的阈值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45640532/

python - Pandas 总和高于所有可能的阈值

上一篇：python - 如何为 Cloud ML Engine 打包词汇文件

下一篇：python - 如何使 random.choice 包含不选择？