python - 将计算应用于 Pandas DataFrame 中的过滤值

标签 python pandas dataframe

我是 Pandas 新手。

将此视为我的数据框:

df

Search              Impressions     Clicks      Transactions    ContainsBest       ContainsFree         Country
Best phone          10              5           1               True               False                UK
Best free phone     15              4           2               True               True                 UK
free phone          20              3           4               False              True                 UK
good phone          13              1           5               False              False                US
just a free phone   12              3           4               False              True                 US

我有列 ContainsBestContainsFree。我想对所有展示次数点击次数交易进行求和,其中ContainsBestTrue,然后我想对 ImpressionsClicksTransactions 求和,其中 ContainsFree 为 True,并对每个唯一值执行相同的操作列国家。所以新的 DataFrame 看起来像这样:

output_df

Country             Impressions     Clicks      Transactions
UK                  45              12          7
ContainsBest        25              9           3
ContainsFree        35              7           6

US                  25              4           9
ContainsBest        0               0           0
ContainsFree        12              3           4

为此,我知道我需要使用如下内容:

uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')

uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])

uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])

然后,我将对点击交易应用相同的逻辑,并为Country US重做相同的代码.

我想要实现的第二件事是为每个CountryImpressionsClicks添加列TopCategoriesTransactions,这样我的 final_output_df 将如下所示:

final_output_df

Country             Impressions     Clicks      Transactions        TopCategoriesForImpressions     TopCategoriesForClicks          TopCategoriesForTransactions     
UK                  45              12          7                   ContainsFree                    ContainsBest                    ContainsFree
ContainsBest        25              9           3                   ContainsBest                    ContainsFree                    ContainsBest
ContainsFree        35              7           6

US                  25              4           9                   ContainsFree                    ContainsFree                    ContainsFree
ContainsBest        0               0           0
ContainsFree        12              3           4

TopCategoriesForxx 列逻辑是对 Country 列下的 ContainsBestContainsFree 行进行简单排序。因此,UK 国家/地区的 TopCategoriesForImpressions

  1. 包含免费
  2. 包含最佳内容

英国国家/地区的TopCategoriesForClicks是:

  1. 包含最佳内容
  2. 包含免费

我知道我需要使用这样的东西:

TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')

我发现很难将所有内容都看起来像我的上一个final_output_df。另外,我假设我不需要创建 output_df,只是想添加它以便更好地理解实现 final_output_df 的步骤。

所以我的问题是:

  1. 如何根据一个或多个条件应用计算?请参阅行 ContainsBestContainsFree
  2. 如何根据条件对列值进行排序?请参阅TopCategoriesForImpressions
  3. 实际上,我有 70 个国家/地区和 20 个 Containsxxx 列,有没有办法在不为 70 个国家/地区和 20 个 Containsxxx 列添加条件的情况下实现这一目标?

非常感谢您的建议。

最佳答案

解决方案的第一部分应该是:

#removed unnecessary column Search and added ContainAll column filled Trues
df1 = df.drop('Search', 1).assign(ContainAll = True)

#columns for tests
cols1 = ['Impressions','Clicks','Transactions']
cols2 = ['ContainsBest','ContainsFree','ContainAll']

print (df1[cols2].dtypes)
ContainsBest    bool
ContainsFree    bool
ContainAll      bool
dtype: object

print (df1[cols1].dtypes)
Impressions     int64
Clicks          int64
Transactions    int64
dtype: object
<小时/>
print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask'))
   Country  Impressions  Clicks  Transactions          Type   mask
0       UK           10       5             1  ContainsBest   True
1       UK           15       4             2  ContainsBest   True
2       UK           20       3             4  ContainsBest  False
3       US           13       1             5  ContainsBest  False
4       US           12       3             4  ContainsBest  False
5       UK           10       5             1  ContainsFree  False
6       UK           15       4             2  ContainsFree   True
7       UK           20       3             4  ContainsFree   True
8       US           13       1             5  ContainsFree  False
9       US           12       3             4  ContainsFree   True
10      UK           10       5             1    ContainAll   True
11      UK           15       4             2    ContainAll   True
12      UK           20       3             4    ContainAll   True
13      US           13       1             5    ContainAll   True
14      US           12       3             4    ContainAll   True

print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask').query('mask'))
   Country  Impressions  Clicks  Transactions          Type  mask
0       UK           10       5             1  ContainsBest  True
1       UK           15       4             2  ContainsBest  True
6       UK           15       4             2  ContainsFree  True
7       UK           20       3             4  ContainsFree  True
9       US           12       3             4  ContainsFree  True
10      UK           10       5             1    ContainAll  True
11      UK           15       4             2    ContainAll  True
12      UK           20       3             4    ContainAll  True
13      US           13       1             5    ContainAll  True
14      US           12       3             4    ContainAll  True
<小时/>
#all possible combinations of Country and boolean columns
mux = pd.MultiIndex.from_product([df['Country'].unique(), cols2], 
                                  names=['Country','Type'])

#reshape by melt for all boolean column to one mask column
#filter Trues by loc and aggregate sum
#add 0 rows by reindex
df1 = (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask')
          .query('mask')
          .drop('mask', axis=1)
          .groupby(['Country','Type'])
          .sum()
          .reindex(mux, fill_value=0)
          .reset_index())
print (df1)
  Country          Type  Impressions  Clicks  Transactions
0      UK  ContainsBest           25       9             3
1      UK  ContainsFree           35       7             6
2      UK    ContainAll           45      12             7
3      US  ContainsBest            0       0             0
4      US  ContainsFree           12       3             4
5      US    ContainAll           25       4             9

其次,可以使用 descending ordernumpy.argsort 过滤行进行检查排序。每组:

def f(x):
    i = x.index.to_numpy()
    a = i[(-x.to_numpy()).argsort(axis=0)]
    return pd.DataFrame(a, columns=x.columns)


df2 = (df1[df1['Type'].isin(['ContainsBest','ContainsFree']) &
          ~df1[cols1].eq(0).all(1)]
           .set_index('Type')
           .groupby('Country')[cols1]
           .apply(f)
           .add_prefix('TopCategoriesFor')
           .rename_axis(['Country','Type'])
           .rename({0:'ContainsBest', 1:'ContainsFree'})
)
print (df2)
                     TopCategoriesForImpressions TopCategoriesForClicks  \
Country Type                                                              
UK      ContainsBest                ContainsFree           ContainsBest   
        ContainsFree                ContainsBest           ContainsFree   
US      ContainsBest                ContainsFree           ContainsFree   

                     TopCategoriesForTransactions  
Country Type                                       
UK      ContainsBest                 ContainsFree  
        ContainsFree                 ContainsBest  
US      ContainsBest                 ContainsFree  
<小时/>
df3 = df1.join(df2, on=['Country','Type'])
print (df3)
  Country          Type  Impressions  Clicks  Transactions  \
0      UK  ContainsBest           25       9             3   
1      UK  ContainsFree           35       7             6   
2      UK    ContainAll           45      12             7   
3      US  ContainsBest            0       0             0   
4      US  ContainsFree           12       3             4   
5      US    ContainAll           25       4             9   

  TopCategoriesForImpressions TopCategoriesForClicks  \
0                ContainsFree           ContainsBest   
1                ContainsBest           ContainsFree   
2                         NaN                    NaN   
3                ContainsFree           ContainsFree   
4                         NaN                    NaN   
5                         NaN                    NaN   

  TopCategoriesForTransactions  
0                 ContainsFree  
1                 ContainsBest  
2                          NaN  
3                 ContainsFree  
4                          NaN  
5                          NaN  

关于python - 将计算应用于 Pandas DataFrame 中的过滤值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59682482/

相关文章:

python - 如何在for循环中再次读取输入?

python - 如何获取多个JSON对象的值?

python - Django i18n setlang 不更改 session 数据 django_language

python - PySpark DataFrames - 使用不同类型的列之间的比较进行过滤

python - 合并数据框中的值以在 excel 中写入

python - Pandas dataframe - 具有相同 ID 的数据增量

Python:根据多个条件返回值的 Lambda 函数

python - fillna() 和 map(dict) 不仅填充 NaN,还填充所有值

Python - Pandas - 数据框 : Row Specific Conditional Column Offset

python - 将缺失值保留为 'NaN' 的 LabelEncoder