我有一个时间序列数据集,如下所示:
Date Newspaper City1 City2 Region1Total City3 City4 Region2Total
2017-12-01 NewsPaper1 231563 8696 240259 21072 8998 30070
2017-12-01 NewsPaper2 173009 12180 185189 28910 5550 34460
2017-12-01 NewsPaper3 40511 4600 45111 5040 3330 8370
2017-12-01 NewsPaper4 37770 2980 40750 6520 1880 8400
2017-12-01 NewsPaper5 5176 900 6076 1790 5000 6790
2017-12-01 NewsPaper6 137650 8025 145675 25300 11000 36300
2017-12-01 Total 637547 38201 675748 91032 36558 127590
2018-01-01 NewsPaper1 231295 8391 239686 8790 21176 29966
2018-01-01 NewsPaper2 169937 12130 182067 7890 28850 36740
2018-01-01 NewsPaper3 40453 4570 45023 4750 5055 9800
2018-01-01 NewsPaper4 37766 2970 40736 2500 6540 9040
2018-01-01 NewsPaper5 5136 900 6036 5600 1795 7365
2018-01-01 NewsPaper6 137990 8010 146000 14500 25330 39830
2018-01-01 Total 633919 37786 671705 44980 91141 136121
我正在尝试在此数据框的每一列中查找最大 n 个值。我试过下面的方法
somelist = []
data = pd.read_excel('newspaper.csv')
data.index = pd.to_datetime(data['Date'], errors='coerce')
last_month = data.loc[data.index[-1]] # i am considering only the previous month(latest month in the dataframe)
last_month.set_index('Newspaper', inplace = True)
for city in last_month.iloc[:, 2: ]:
top_3 = last_month[city].nlargest(4)[1: ] #highest will be total but we should skip it
somelist.append(top_3)
print(somelist)
这将生成 pandas 系列的结果,其中列的名称如下所示:
[Newspaper
Newspaper1 231295
Newspaper2 169937
Newspaper6 137990
Name: City1, dtype: float64, Newspaper
Newspaper2 12130.0
Newspaper1 8391.0
Newspaper6 8010.0
Name: City2, dtype: float64, Newspaper
Newspaper1 240259
Newspaper2 185189
Newspaper6 145675
Name: Region1Total, dtype: float64, Newspaper
Newspaper6 14500.0
Newspaper1 8790.0
Newspaper2 7890.0
Name: City3, dtype: float64, Newspaper
Newspaper2 28850.0
Newspaper6 25330.0
Newspaper1 21176.0
Name: City4, dtype: float64, Newspaper
Newspaper6 36300
Newspaper2 34460
Newspaper1 34460
Name: Region2Total, dtype: float64, Newspaper]
我要的是每个城市和地区销量前三的报纸,以及销量从大到小排列的数字。我还希望在显示前 3 个结果之前打印城市/地区的名称。
预期输出是一个列表或一个系列,如下所示:
Newspaper City1
Newspaper1 231295
Newspaper2 169937
Newspaper6 137990
Newspaper City2
Newspaper2 12130.0
Newspaper1 8391.0
Newspaper6 8010.0
Newspaper Region1Total
Newspaper1 240259
Newspaper2 185189
Newspaper6 145675
Newspaper City3
Newspaper6 14500.0
Newspaper1 8790.0
Newspaper2 7890.0
Newspaper City4
Newspaper2 28850.0
Newspaper6 25330.0
Newspaper1 21176.0
Newspaper Region2Total
Newspaper6 36300
Newspaper2 34460
Newspaper1 34460
此外,如果我想跳过地区而只考虑城市,那么我该怎么做呢? 任何帮助,将不胜感激。先感谢您。
最佳答案
首先,您需要获取一个数据框,其中只列出报纸,而不是全部。
dff = df.loc[df['Newspaper']!='Total']
然后对于city1
,你可以这样做:
dff[['Newspaper', 'City1']].sort_values(['City1'], ascending=False).head(3)
输出:
Newspaper City1
0 NewsPaper1 231563
1 NewsPaper2 173009
5 NewsPaper6 137650
同样,您可以获得所有感兴趣的列的结果。
关于python - 使用 Pandas 在数据框的列中查找前 3 名,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50863251/