python - 在多索引数据框中减去值并计算百分比

标签 python pandas multi-index

我有一个多索引数据框df:

df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'B',
  8: 'B',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B'},
 ('category', ''): {0: 'Books',
  1: 'Candy',
  2: 'Pencil',
  3: 'Table',
  4: 'PC',
  5: 'Printer',
  6: 'Lamp',
  7: 'Books',
  8: 'Candy',
  9: 'Pencil',
  10: 'Table',
  11: 'PC',
  12: 'Printer',
  13: 'Lamp'},
 (pd.Timestamp('2021-06-28 00:00:00'),
  'Sales_1'): {0: 9.937449997200002, 1: 30.71300000639998, 2: 58.81199999639999, 3: 25.661999978399994, 4: 3.657999996, 5: 12.0879999972, 6: 61.16600000040001, 7: 6.319439989199998, 8: 12.333119997600003, 9: 24.0544100028, 10: 24.384659998799997, 11: 1.9992000012000002, 12: 0.324, 13: 40.69122000000001},
 (pd.Timestamp('2021-06-28 00:00:00'),
  'Sales_2'): {0: 21.890370397789923, 1: 28.300470581874837, 2: 53.52039700062155, 3: 52.425508769690694, 4: 6.384936971649232, 5: 6.807138946302334, 6: 52.172, 7: 5.916852561, 8: 5.810764652, 9: 12.1243325, 10: 17.88071596, 11: 0.913782413, 12: 0.869207661, 13: 20.9447844},
 (pd.Timestamp('2021-06-28 00:00:00'), 'last_week_sales'): {0: np.nan,
  1: np.nan,
  2: np.nan,
  3: np.nan,
  4: np.nan,
  5: np.nan,
  6: np.nan,
  7: np.nan,
  8: np.nan,
  9: np.nan,
  10: np.nan,
  11: np.nan,
  12: np.nan,
  13: np.nan},
 (pd.Timestamp('2021-06-28 00:00:00'), 'total_orders'): {0: 86.0,
  1: 66.0,
  2: 188.0,
  3: 556.0,
  4: 12.0,
  5: 4.0,
  6: 56.0,
  7: 90.0,
  8: 26.0,
  9: 49.0,
  10: 250.0,
  11: 7.0,
  12: 2.0,
  13: 44.0},
 (pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 4390.11,
  1: 24825.059999999998,
  2: 48592.39999999998,
  3: 60629.77,
  4: 831.22,
  5: 1545.71,
  6: 34584.99,
  7: 5641.54,
  8: 6798.75,
  9: 13290.13,
  10: 42692.68000000001,
  11: 947.65,
  12: 329.0,
  13: 29889.65},
 (pd.Timestamp('2021-07-05 00:00:00'),
  'Sales_1'): {0: 13.690399997999998, 1: 38.723000005199985, 2: 72.4443400032, 3: 36.75802000560001, 4: 5.691999996, 5: 7.206999998399999, 6: 66.55265999039996, 7: 6.4613199911999954, 8: 12.845630001599998, 9: 26.032340003999998, 10: 30.1634600016, 11: 1.0203399996, 12: 1.4089999991999997, 13: 43.67116000320002},
 (pd.Timestamp('2021-07-05 00:00:00'),
  'Sales_2'): {0: 22.874363860953647, 1: 29.5726042895728, 2: 55.926190956481534, 3: 54.7820864335212, 4: 6.671946105284065, 5: 7.113126469779095, 6: 54.517, 7: 6.194107518, 8: 6.083562133, 9: 12.69221484, 10: 18.71872129, 11: 0.956574175, 12: 0.910216433, 13: 21.92632044},
 (pd.Timestamp('2021-07-05 00:00:00'), 'last_week_sales'): {0: 4390.11,
  1: 24825.059999999998,
  2: 48592.39999999998,
  3: 60629.77,
  4: 831.22,
  5: 1545.71,
  6: 34584.99,
  7: 5641.54,
  8: 6798.75,
  9: 13290.13,
  10: 42692.68000000001,
  11: 947.65,
  12: 329.0,
  13: 29889.65},
 (pd.Timestamp('2021-07-05 00:00:00'), 'total_orders'): {0: 109.0,
  1: 48.0,
  2: 174.0,
  3: 587.0,
  4: 13.0,
  5: 5.0,
  6: 43.0,
  7: 62.0,
  8: 13.0,
  9: 37.0,
  10: 196.0,
  11: 8.0,
  12: 1.0,
  13: 33.0},
 (pd.Timestamp('2021-07-05 00:00:00'), 'total_sales'): {0: 3453.02,
  1: 17868.730000000003,
  2: 44707.82999999999,
  3: 60558.97999999999,
  4: 1261.0,
  5: 1914.6000000000001,
  6: 24146.09,
  7: 6201.489999999999,
  8: 5513.960000000001,
  9: 9645.87,
  10: 25086.785,
  11: 663.0,
  12: 448.61,
  13: 26332.7}}).set_index(['group','category'])

我正在尝试为每个 date 获取一列,该列将是 Sales_2*1000 - total_sales 并计算类别如何按 total_sales 以百分比划分,这将是每周的 sum 除以每个 total_sales 销售额 category

我尝试过的:

df['diff'] = df.loc[:,(slice(None),'total_sales')] - df.loc[:,(slice(None),'Sales_2')]

但我明白了

ValueError: Wrong number of items passed 4, placement implies 1

因为这试图将 4 列放入 1 列中,而不是每个 *100 列的结果。对于每个 datetotal_salescategory 总百分比:

df.loc[:,(slice(None),'total_sales')].groupby(level=['group','category']).apply(lambda x: 100 * x / x.sum())

但是所有值都是 date,所以我不确定如何在 100 旁边有一列,如下所示:

                2021-06-28 00:00:00                           2021-07-05 00:00:00
                total_sales      %_split     difference        total_sales          %_split     difference
group   category                            
A       Books   4,390.110        9%          ...                   3,453.020         ...        ...
        Candy   24,825.060       11%         ...                   17,868.730        ...        ...
        Pencil  48,592.400       10%         ...                   44,707.830        ...        ...
        Table   60,629.770       40%         ...                   60,558.980        ...        ...
        PC      831.220          3%          ...                   1,261.000         ...        ...
        Printer 1,545.710        7%          ...                   1,914.600         ...        ...
        Lamp    34,584.990       30%         ...                   24,146.090        ...        ...
B       Books   5,641.540        ...         ...                   6,201.490         ...        ...
        Candy   6,798.750        ...         ...                   5,513.960         ...        ...
        Pencil  13,290.130       ...         ...                   9,645.870         ...        ...
        Table   42,692.680       ...         ...                   25,086.785        ...        ...
        PC      947.650          ...         ...                   663.000           ...        ...
        Printer 329.000          ...         ...                   448.610           ...        ...
        Lamp    29,889.650       ...         ...                   26,332.700        ...        ...

total_salesdifference ,为了可见性,我只包含了 2 列,实际上,我需要 total_sales - sales_2*1000 中存在的所有列以及每个 df 列的 2 个附加列。

最佳答案

我们可以尝试

s = df.stack(level=0)
s['diff'] = s.eval('total_sales - Sales_2 * 1000')

sales_per_group = s['total_sales'].groupby(level=[0, 2]).transform('sum')
s['split %']    = s['total_sales'] / sales_per_group * 100

s = s.stack(dropna=False).unstack([2, 3])

print(s)
                           2021-06-28 00:00:00                                                                              2021-07-05 00:00:00                                                                             
                           Sales_1    Sales_2 last_week_sales total_orders total_sales          diff    split %             Sales_1    Sales_2 last_week_sales total_orders total_sales          diff    split %
group category                                                                                                                                                                                                  
A     Books                9.93745  21.890370             NaN         86.0     4390.11 -17500.260398   2.502924            13.69040  22.874364         4390.11        109.0    3453.020 -19421.343861   2.243528
      Candy               30.71300  28.300471             NaN         66.0    24825.06  -3475.410582  14.153458            38.72300  29.572604        24825.06         48.0   17868.730 -11703.874290  11.609838
      Lamp                61.16600  52.172000             NaN         56.0    34584.99 -17587.010000  19.717865            66.55266  54.517000        34584.99         43.0   24146.090 -30370.910000  15.688422
      PC                   3.65800   6.384937             NaN         12.0      831.22  -5553.716972   0.473902             5.69200   6.671946          831.22         13.0    1261.000  -5410.946105   0.819309
      Pencil              58.81200  53.520397             NaN        188.0    48592.40  -4927.997001  27.703880            72.44434  55.926191        48592.40        174.0   44707.830 -11218.360956  29.047987
      Printer             12.08800   6.807139             NaN          4.0     1545.71  -5261.428946   0.881252             7.20700   7.113126         1545.71          5.0    1914.600  -5198.526470   1.243972
      Table               25.66200  52.425509             NaN        556.0    60629.77   8204.261230  34.566719            36.75802  54.782086        60629.77        587.0   60558.980   5776.893566  39.346944
B     Books                6.31944   5.916853             NaN         90.0     5641.54   -275.312561   5.664800             6.46132   6.194108         5641.54         62.0    6201.490      7.382482   8.392593
      Candy               12.33312   5.810765             NaN         26.0     6798.75    987.985348   6.826781            12.84563   6.083562         6798.75         13.0    5513.960   -569.602133   7.462146
      Lamp                40.69122  20.944784             NaN         44.0    29889.65   8944.865600  30.012883            43.67116  21.926320        29889.65         33.0   26332.700   4406.379560  35.636540
      PC                   1.99920   0.913782             NaN          7.0      947.65     33.867587   0.951557             1.02034   0.956574          947.65          8.0     663.000   -293.574175   0.897250
      Pencil              24.05441  12.124332             NaN         49.0    13290.13   1165.797500  13.344924            26.03234  12.692215        13290.13         37.0    9645.870  -3046.344840  13.053938
      Printer              0.32400   0.869208             NaN          2.0      329.00   -540.207661   0.330356             1.40900   0.910216          329.00          1.0     448.610   -461.606433   0.607112
      Table               24.38466  17.880716             NaN        250.0    42692.68  24811.964040  42.868699            30.16346  18.718721        42692.68        196.0   25086.785   6368.063710  33.950420

关于python - 在多索引数据框中减去值并计算百分比,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68438326/

相关文章:

python - 使用 Pandas 添加来自不同表的属性

python - 如何使用 pid 从 Python 终止进程?

python - 如何使用 boto3 获取用户当前的密码期限

python - 基于过滤器的列计算?

pandas - 将多索引数据帧转换为系列

python - Pandas 滚动功能中 win_type 参数背后的直觉是什么?

python - 从列中提取数字以在 Pandas 中创建新列

python - 溢出错误 : cannot convert float infinity to integer

python - 使用 Pandas 绘制多索引表中的特定列

c++ - boost multi_index_container 不编译