我有一个多索引数据框df
:
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B'},
('category', ''): {0: 'Books',
1: 'Candy',
2: 'Pencil',
3: 'Table',
4: 'PC',
5: 'Printer',
6: 'Lamp',
7: 'Books',
8: 'Candy',
9: 'Pencil',
10: 'Table',
11: 'PC',
12: 'Printer',
13: 'Lamp'},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_1'): {0: 9.937449997200002, 1: 30.71300000639998, 2: 58.81199999639999, 3: 25.661999978399994, 4: 3.657999996, 5: 12.0879999972, 6: 61.16600000040001, 7: 6.319439989199998, 8: 12.333119997600003, 9: 24.0544100028, 10: 24.384659998799997, 11: 1.9992000012000002, 12: 0.324, 13: 40.69122000000001},
(pd.Timestamp('2021-06-28 00:00:00'),
'Sales_2'): {0: 21.890370397789923, 1: 28.300470581874837, 2: 53.52039700062155, 3: 52.425508769690694, 4: 6.384936971649232, 5: 6.807138946302334, 6: 52.172, 7: 5.916852561, 8: 5.810764652, 9: 12.1243325, 10: 17.88071596, 11: 0.913782413, 12: 0.869207661, 13: 20.9447844},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_week_sales'): {0: np.nan,
1: np.nan,
2: np.nan,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: np.nan},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_orders'): {0: 86.0,
1: 66.0,
2: 188.0,
3: 556.0,
4: 12.0,
5: 4.0,
6: 56.0,
7: 90.0,
8: 26.0,
9: 49.0,
10: 250.0,
11: 7.0,
12: 2.0,
13: 44.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_1'): {0: 13.690399997999998, 1: 38.723000005199985, 2: 72.4443400032, 3: 36.75802000560001, 4: 5.691999996, 5: 7.206999998399999, 6: 66.55265999039996, 7: 6.4613199911999954, 8: 12.845630001599998, 9: 26.032340003999998, 10: 30.1634600016, 11: 1.0203399996, 12: 1.4089999991999997, 13: 43.67116000320002},
(pd.Timestamp('2021-07-05 00:00:00'),
'Sales_2'): {0: 22.874363860953647, 1: 29.5726042895728, 2: 55.926190956481534, 3: 54.7820864335212, 4: 6.671946105284065, 5: 7.113126469779095, 6: 54.517, 7: 6.194107518, 8: 6.083562133, 9: 12.69221484, 10: 18.71872129, 11: 0.956574175, 12: 0.910216433, 13: 21.92632044},
(pd.Timestamp('2021-07-05 00:00:00'), 'last_week_sales'): {0: 4390.11,
1: 24825.059999999998,
2: 48592.39999999998,
3: 60629.77,
4: 831.22,
5: 1545.71,
6: 34584.99,
7: 5641.54,
8: 6798.75,
9: 13290.13,
10: 42692.68000000001,
11: 947.65,
12: 329.0,
13: 29889.65},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_orders'): {0: 109.0,
1: 48.0,
2: 174.0,
3: 587.0,
4: 13.0,
5: 5.0,
6: 43.0,
7: 62.0,
8: 13.0,
9: 37.0,
10: 196.0,
11: 8.0,
12: 1.0,
13: 33.0},
(pd.Timestamp('2021-07-05 00:00:00'), 'total_sales'): {0: 3453.02,
1: 17868.730000000003,
2: 44707.82999999999,
3: 60558.97999999999,
4: 1261.0,
5: 1914.6000000000001,
6: 24146.09,
7: 6201.489999999999,
8: 5513.960000000001,
9: 9645.87,
10: 25086.785,
11: 663.0,
12: 448.61,
13: 26332.7}}).set_index(['group','category'])
我正在尝试为每个 date
获取一列,该列将是 Sales_2*1000 - total_sales
并计算类别如何按 total_sales
以百分比划分,这将是每周的 sum
除以每个 total_sales
销售额 category
。
我尝试过的:
df['diff'] = df.loc[:,(slice(None),'total_sales')] - df.loc[:,(slice(None),'Sales_2')]
但我明白了
ValueError: Wrong number of items passed 4, placement implies 1
因为这试图将 4 列放入 1 列中,而不是每个 *100
列的结果。对于每个 date
和 total_sales
的 category
总百分比:
df.loc[:,(slice(None),'total_sales')].groupby(level=['group','category']).apply(lambda x: 100 * x / x.sum())
但是所有值都是 date
,所以我不确定如何在 100
旁边有一列,如下所示:
2021-06-28 00:00:00 2021-07-05 00:00:00
total_sales %_split difference total_sales %_split difference
group category
A Books 4,390.110 9% ... 3,453.020 ... ...
Candy 24,825.060 11% ... 17,868.730 ... ...
Pencil 48,592.400 10% ... 44,707.830 ... ...
Table 60,629.770 40% ... 60,558.980 ... ...
PC 831.220 3% ... 1,261.000 ... ...
Printer 1,545.710 7% ... 1,914.600 ... ...
Lamp 34,584.990 30% ... 24,146.090 ... ...
B Books 5,641.540 ... ... 6,201.490 ... ...
Candy 6,798.750 ... ... 5,513.960 ... ...
Pencil 13,290.130 ... ... 9,645.870 ... ...
Table 42,692.680 ... ... 25,086.785 ... ...
PC 947.650 ... ... 663.000 ... ...
Printer 329.000 ... ... 448.610 ... ...
Lamp 29,889.650 ... ... 26,332.700 ... ...
total_sales
是 difference
,为了可见性,我只包含了 2 列,实际上,我需要 total_sales - sales_2*1000
中存在的所有列以及每个 df
列的 2 个附加列。
最佳答案
我们可以尝试
s = df.stack(level=0)
s['diff'] = s.eval('total_sales - Sales_2 * 1000')
sales_per_group = s['total_sales'].groupby(level=[0, 2]).transform('sum')
s['split %'] = s['total_sales'] / sales_per_group * 100
s = s.stack(dropna=False).unstack([2, 3])
print(s)
2021-06-28 00:00:00 2021-07-05 00:00:00
Sales_1 Sales_2 last_week_sales total_orders total_sales diff split % Sales_1 Sales_2 last_week_sales total_orders total_sales diff split %
group category
A Books 9.93745 21.890370 NaN 86.0 4390.11 -17500.260398 2.502924 13.69040 22.874364 4390.11 109.0 3453.020 -19421.343861 2.243528
Candy 30.71300 28.300471 NaN 66.0 24825.06 -3475.410582 14.153458 38.72300 29.572604 24825.06 48.0 17868.730 -11703.874290 11.609838
Lamp 61.16600 52.172000 NaN 56.0 34584.99 -17587.010000 19.717865 66.55266 54.517000 34584.99 43.0 24146.090 -30370.910000 15.688422
PC 3.65800 6.384937 NaN 12.0 831.22 -5553.716972 0.473902 5.69200 6.671946 831.22 13.0 1261.000 -5410.946105 0.819309
Pencil 58.81200 53.520397 NaN 188.0 48592.40 -4927.997001 27.703880 72.44434 55.926191 48592.40 174.0 44707.830 -11218.360956 29.047987
Printer 12.08800 6.807139 NaN 4.0 1545.71 -5261.428946 0.881252 7.20700 7.113126 1545.71 5.0 1914.600 -5198.526470 1.243972
Table 25.66200 52.425509 NaN 556.0 60629.77 8204.261230 34.566719 36.75802 54.782086 60629.77 587.0 60558.980 5776.893566 39.346944
B Books 6.31944 5.916853 NaN 90.0 5641.54 -275.312561 5.664800 6.46132 6.194108 5641.54 62.0 6201.490 7.382482 8.392593
Candy 12.33312 5.810765 NaN 26.0 6798.75 987.985348 6.826781 12.84563 6.083562 6798.75 13.0 5513.960 -569.602133 7.462146
Lamp 40.69122 20.944784 NaN 44.0 29889.65 8944.865600 30.012883 43.67116 21.926320 29889.65 33.0 26332.700 4406.379560 35.636540
PC 1.99920 0.913782 NaN 7.0 947.65 33.867587 0.951557 1.02034 0.956574 947.65 8.0 663.000 -293.574175 0.897250
Pencil 24.05441 12.124332 NaN 49.0 13290.13 1165.797500 13.344924 26.03234 12.692215 13290.13 37.0 9645.870 -3046.344840 13.053938
Printer 0.32400 0.869208 NaN 2.0 329.00 -540.207661 0.330356 1.40900 0.910216 329.00 1.0 448.610 -461.606433 0.607112
Table 24.38466 17.880716 NaN 250.0 42692.68 24811.964040 42.868699 30.16346 18.718721 42692.68 196.0 25086.785 6368.063710 33.950420
关于python - 在多索引数据框中减去值并计算百分比,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68438326/