我的数据如下:
Region = [random.choice([1,2,3,4,5]) for x in range(100)]
Gender = [random.choice(['Male', 'Female']) for x in range(100)]
Balance = [random.random()*1000 for x in range(100)]
df = pd.DataFrame({'Region':Region, 'Gender':Gender, 'Balance':Balance})
我想获得一个带有索引(区域、性别)的多索引 DataFrame,以便我可以调用 df.plot.box(vert=False) 并获得类似的内容(在 R 中生成) )出来。
这看起来应该很简单,但我似乎无法获得正确的 reshape /索引。
最佳答案
np.random.seed(23)
Region = np.random.choice([1,2,3,4,5], size=100)
Gender = np.random.choice(['Male', 'Female'], size=100)
Balance = np.random.rand(100)*1000
df = pd.DataFrame({'Region':Region, 'Gender':Gender, 'Balance':Balance})
print (df.head())
Balance Gender Region
0 384.491355 Female 4
1 328.787350 Female 1
2 529.003182 Male 2
3 96.884964 Female 1
4 23.379931 Male 5
我认为需要首先连接Region
和Gender
,使用cumcount
用于对pivot
的每个组进行计数:
idx = df['Region'].astype(str) + '.' + df['Gender']
cols = idx.groupby(idx).cumcount()
df1 = pd.pivot(index=cols, columns=idx, values=df['Balance'])
print (df1)
print (df1.head())
1.Female 1.Male 2.Female 2.Male 3.Female 3.Male \
0 328.787350 298.232904 888.262152 529.003182 959.644810 962.342645
1 96.884964 780.852785 738.040024 760.956146 119.652522 601.118950
2 910.707827 611.333680 116.517822 155.214746 140.653479 688.654958
3 50.119030 205.932674 148.848025 794.379306 380.307363 194.257663
4 263.554386 605.087006 953.241083 113.801236 778.912082 170.791317
4.Female 4.Male 5.Female 5.Male
0 384.491355 122.347230 400.107360 23.379931
1 190.038651 564.785449 330.269653 998.586681
2 521.390446 757.714947 512.813561 185.192917
3 566.314099 939.538858 480.686727 80.862220
4 927.260017 175.496721 342.465179 287.932951
df1.plot.box(vert=False)
<小时/>
旧的解决方案:
看来你需要通过groupby
reshape 并聚合平均值
或pivot_table
:
a = df.groupby(['Gender','Region'])['Balance'].mean().unstack()
#alternatively
#a = df.pivot_table(index='Gender', columns='Region', values='Balance', aggfunc='mean')
print (a)
Region 1 2 3 4 5
Gender
Female 357.970914 679.143664 442.473514 498.600391 618.475656
Male 531.211030 462.071729 470.280364 623.540595 362.917609
a.plot.box(vert=False)
b = df.groupby(['Region','Gender'])['Balance'].mean().unstack()
print (b)
Gender Female Male
Region
1 357.970914 531.211030
2 679.143664 462.071729
3 442.473514 470.280364
4 498.600391 623.540595
5 618.475656 362.917609
b.plot.box(vert=False)
关于python - 如何对 pandas DataFrame 进行分组和索引以获得所需的箱形图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45648071/