python - 在数据透视表中创建汇总行

标签 python pandas

我有数据框:

df = pd.DataFrame({'State': {0: "AZ", 1: "AZ", 2:"AZ", 4: "AZ", 5: "AK", 6: "AK", 7 : "AK", 8: "AK"}, 
               'City': {0: "A", 1: "A", 2:"B", 4: "B", 5: "C", 6: "C", 7 : "D", 8: "D"}, 
               'Area': {0: "North", 1: "South", 2:"North", 4: "South", 5: "North", 6: "South", 7 : "North", 8: "South"}, 
               'Restaurant': {0: "Rest1", 1: "Rest2", 2:"Rest3", 4: "Rest4", 5: "Rest5", 6: "Rest6", 7 : "Rest7", 8: "Rest8"}, 
               'Price': {0: 2343, 1: 23445, 2:34536, 4: 7456, 5: 6584, 6: 64563, 7 : 54745, 8: 436345}},
               columns=['State','City','Area','Restaurant','Price'])

print(df)
State City   Area Restaurant   Price
  0    AZ    A  North      Rest1    2343
  1    AZ    A  South      Rest2   23445
  2    AZ    B  North      Rest3   34536
  ...

我还有以下数据透视表:

pivo=pd.pivot_table(df,values=["Price"],
                columns=['State',"City", 'Area'],
                margins=True,
                aggfunc=[len, np.mean])
print(pivo)
                        len        mean
     State City Area                  
Price AK    C    North    1    6584.000
                 South    1   64563.000
            D    North    1   54745.000
                 South    1  436345.000
      AZ    A    North    1    2343.000
                 South    1   23445.000
            B    North    1   34536.000
                 South    1    7456.000
      All                 8   78752.125

我希望能够计算聚合每个州和每个城市的“全部”行,使其看起来像这样:

                        len        mean
     State City Area                  
Price AK    All           4     281118.5
            C    All      2     35573.5
                 North    1    6584.000
                 South    1   64563.000
            D    All      2     245545
                 North    1   54745.000
                 South    1  436345.000
      ...

我一直在玩 unstack/stack 但我还没有产生任何接近的东西。

谢谢!

编辑:这是我得到的最接近的:

pivo=pd.pivot_table(df,values=["Price"],
                index=['State'],
                columns=["City", 'Area'],
                margins=True,
                aggfunc=[len, np.mean])

                   len        mean
                 Price       Price
State City Area                   
AK    All          4.0  140559.000
      C    North   1.0    6584.000
           South   1.0   64563.000
      D    North   1.0   54745.000
           South   1.0  436345.000
AZ    A    North   1.0    2343.000
           South   1.0   23445.000
      All          4.0   16945.000
      B    North   1.0   34536.000
           South   1.0    7456.000
All   A    North   1.0    2343.000
           South   1.0   23445.000
      All          8.0   78752.125
      B    North   1.0   34536.000
           South   1.0    7456.000
      C    North   1.0    6584.000
           South   1.0   64563.000
      D    North   1.0   54745.000
           South   1.0  436345.000

最佳答案

编辑:错过了您也想要其中的州边距这一事实。我保留原来的答案以防万一——它可能仍然有用。向下滚动查看一些古怪的 Pandas 。

<小时/>

这有帮助吗?

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({'State': {0: "AZ", 1: "AZ", 2:"AZ", 4: "AZ", 5: "AK", 6: "AK", 7 : "AK", 8: "AK"},
   ...:
   ...:                'City': {0: "A", 1: "A", 2:"B", 4: "B", 5: "C", 6: "C", 7 : "D", 8: "D"},
   ...:                'Area': {0: "North", 1: "South", 2:"North", 4: "South", 5: "North", 6: "South", 7 : "No
   ...: rth", 8: "South"},
   ...:                'Restaurant': {0: "Rest1", 1: "Rest2", 2:"Rest3", 4: "Rest4", 5: "Rest5", 6: "Rest6", 7
   ...:  : "Rest7", 8: "Rest8"},
   ...:                'Price': {0: 2343, 1: 23445, 2:34536, 4: 7456, 5: 6584, 6: 64563, 7 : 54745, 8: 436345}
   ...: },
   ...:                columns=['State','City','Area','Restaurant','Price'])

In [4]: pv = (df.pivot_table(index=['State', 'City'],
   ...:                    columns=['Area'],
   ...:                    values=['Price'],
   ...:                    margins=True,
   ...:                    aggfunc=[len, np.mean]))

In [5]: pv
Out[5]:
             len                mean
           Price               Price
Area       North South  All    North     South         All
State City
AK    C      1.0   1.0  2.0   6584.0   64563.0   35573.500
      D      1.0   1.0  2.0  54745.0  436345.0  245545.000
AZ    A      1.0   1.0  2.0   2343.0   23445.0   12894.000
      B      1.0   1.0  2.0  34536.0    7456.0   20996.000
All          4.0   4.0  8.0  24552.0  132952.0   78752.125

In [6]: pv.stack()
Out[6]:
                   len        mean
                 Price       Price
State City Area
AK    C    All     2.0   35573.500
           North   1.0    6584.000
           South   1.0   64563.000
      D    All     2.0  245545.000
           North   1.0   54745.000
           South   1.0  436345.000
AZ    A    All     2.0   12894.000
           North   1.0    2343.000
           South   1.0   23445.000
      B    All     2.0   20996.000
           North   1.0   34536.000
           South   1.0    7456.000
All        All     8.0   78752.125
           North   4.0   24552.000
           South   4.0  132952.000

作为一句台词:

In [7]: pv = (df.pivot_table(index=['State', 'City'],
   ...:                    columns=['Area'],
   ...:                    values=['Price'],
   ...:                    margins=True,
   ...:                    aggfunc=[len, np.mean])
   ...:       .stack())

In [8]: pv
Out[8]:
                   len        mean
                 Price       Price
State City Area
AK    C    All     2.0   35573.500
           North   1.0    6584.000
           South   1.0   64563.000
      D    All     2.0  245545.000
           North   1.0   54745.000
           South   1.0  436345.000
AZ    A    All     2.0   12894.000
           North   1.0    2343.000
           South   1.0   23445.000
      B    All     2.0   20996.000
           North   1.0   34536.000
           South   1.0    7456.000
All        All     8.0   78752.125
           North   4.0   24552.000
           South   4.0  132952.000
<小时/>

添加状态边距有点麻烦,而且一点也不优雅。我很乐意看到这方面的改进。

<小时/>
In [9]: pv = (df.pivot_table(index=['State', 'City'],
   ...:                    columns=['Area'],
   ...:                    values=['Price'],
   ...:                    margins=True,
   ...:                    aggfunc=[len, np.mean]))

In [10]: state_agg = (df[['Price', 'State']]
    ...:              .pivot_table(index='State', aggfunc=[len, np.mean], margins=True)
    ...:              .assign(City= 'state_margin').assign(Area="")
    ...:              )
    ...: state_agg.loc['All', 'City'] = 'total'
    ...:

In [11]: state_agg
Out[11]:
        len        mean          City Area
      Price       Price
State
AK      4.0  140559.000  state_margin
AZ      4.0   16945.000  state_margin
All     8.0   78752.125         total

下面的iloc[0:-1]会删除第一个数据透视表上的边距行。

In [12]: results = (pd.concat([pv.iloc[0:-1].stack().reset_index(),
    ...:            state_agg.reset_index()
    ...:            ])
    ...:  ).set_index(['State', 'City', 'Area']).sort_index()

In [13]: results
Out[13]:
                           len        mean
                         Price       Price
State City         Area
AK    C            All     2.0   35573.500
                   North   1.0    6584.000
                   South   1.0   64563.000
      D            All     2.0  245545.000
                   North   1.0   54745.000
                   South   1.0  436345.000
      state_margin         4.0  140559.000
AZ    A            All     2.0   12894.000
                   North   1.0    2343.000
                   South   1.0   23445.000
      B            All     2.0   20996.000
                   North   1.0   34536.000
                   South   1.0    7456.000
      state_margin         4.0   16945.000
All   total                8.0   78752.125

In [14]: idx = pd.IndexSlice
    ...: results.loc[idx[:, 'state_margin'], :]
    ...:
Out[14]:
                          len      mean
                        Price     Price
State City         Area
AK    state_margin        4.0  140559.0
AZ    state_margin        4.0   16945.0

关于python - 在数据透视表中创建汇总行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39520069/

相关文章:

python - Plotly:如何更改箱线图中 mustache 的长度(最小/最大)?

python - 漂亮地打印 Pandas 数据框

python - 从文本中提取年龄值以在 pandas 中创建新列

python - 将多个字符串列表转换为 Python 数据框

python - 如何连接 2 个数据框并基于过滤器 pyspark 添加新列

python - 重定向从 Python 调用的 launchfile 的输出

Python:导入错误:没有名为 'database' 的模块

python - 使用 PrintfTickFormatter 将 Bokeh x 轴从十进制格式化为百分比

python - 将 Pandas 列中的字符串转换为 int

python - 如何将 DatetimeIndexResamplerGroupby 对象转换为数据框?