我有以下数据框:
Country Crop Harvest Year Area (ha)
Afghanistan Maize 2019 94910
Afghanistan Maize 2020 140498
Afghanistan Maize 2021 92144
Afghanistan Winter Wheat 2019 2334000
Afghanistan Winter Wheat 2020 2668000
Afghanistan Winter Wheat 2021 1833357
Argentina Maize 2019 7232761
Argentina Maize 2020 7730506
Argentina Maize 2021 8146596
Argentina Winter Wheat 2019 6050953
Argentina Winter Wheat 2020 6729838
Argentina Winter Wheat 2021 6394102
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
Ethiopia Maize 2019 2274306
Ethiopia Maize 2020 2363507
Ethiopia Maize 2021 2530000
Ethiopia Winter Wheat 2019 1789372
Ethiopia Winter Wheat 2020 1829051
Ethiopia Winter Wheat 2021 1950000
France Maize 2019 1506100
France Maize 2020 1691130
France Maize 2021 1549520
France Winter Wheat 2019 5244250
France Winter Wheat 2020 4512420
France Winter Wheat 2021 5276730
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
Namibia Maize 2019 21123
Namibia Maize 2020 35000
Namibia Maize 2021 46070
Namibia Winter Wheat 2019 1079
Namibia Winter Wheat 2020 2000
Namibia Winter Wheat 2021 3026
我想根据“收获年份”的面积 (ha)
列的平均值选择前 2 个国家/地区。我尝试了这个,但它不起作用:
df = df.groupby("Crop", dropna=False).apply( lambda x: x.nlargest(2, "面积(公顷)") )
输出应该是,这里中国和印度是 Jade 米和冬小麦平均面积 (ha)
最大的国家,但在完整数据集中,不同国家的不同裁剪会有最大值:
Country Crop Harvest Year Area (ha)
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
最佳答案
IIUC,你可以做两次.groupby
:
x = (
df.groupby("Crop")
.apply(lambda x: x.groupby("Country")["Area (ha)"].mean())
.stack()
.groupby(level=0, group_keys=False)
.nlargest(2)
)
print(x)
按平均面积打印前 2 位农裁剪/国家/地区:
Crop Country
Maize China 4.198587e+07
India 9.485397e+06
Winter Wheat India 3.076193e+07
China 2.356232e+07
dtype: float64
然后你可以使用这个索引来过滤原始数据帧:
out = df.set_index(["Crop", "Country"]).loc[x.index].reset_index()
print(out)
打印:
Crop Country Harvest Year Area (ha)
0 Maize China 2019 41309740
1 Maize China 2020 41292000
2 Maize China 2021 43355859
3 Maize India 2019 9027130
4 Maize India 2020 9569060
5 Maize India 2021 9860000
6 Winter Wheat India 2019 29318780
7 Winter Wheat India 2020 31357020
8 Winter Wheat India 2021 31610000
9 Winter Wheat China 2019 23732560
10 Winter Wheat China 2020 23383000
11 Winter Wheat China 2021 23571400
关于python - 选择 pandas 数据框中的前 n 组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77392339/