python - 使用 Pandas 在数据框的列中查找前 3 名

标签 python pandas

我有一个时间序列数据集,如下所示:

Date        Newspaper   City1    City2   Region1Total   City3   City4  Region2Total
2017-12-01  NewsPaper1  231563   8696    240259         21072   8998   30070
2017-12-01  NewsPaper2  173009   12180   185189         28910   5550   34460
2017-12-01  NewsPaper3  40511    4600    45111          5040    3330   8370
2017-12-01  NewsPaper4  37770    2980    40750          6520    1880   8400
2017-12-01  NewsPaper5  5176     900     6076           1790    5000   6790
2017-12-01  NewsPaper6  137650   8025    145675         25300  11000   36300
2017-12-01  Total       637547   38201   675748         91032  36558   127590

2018-01-01  NewsPaper1  231295   8391    239686         8790   21176   29966
2018-01-01  NewsPaper2  169937   12130   182067         7890   28850   36740
2018-01-01  NewsPaper3  40453    4570    45023          4750   5055    9800
2018-01-01  NewsPaper4  37766    2970    40736          2500   6540    9040
2018-01-01  NewsPaper5  5136     900     6036           5600   1795    7365
2018-01-01  NewsPaper6  137990   8010    146000         14500  25330   39830
2018-01-01  Total       633919   37786   671705         44980  91141   136121 

我正在尝试在此数据框的每一列中查找最大 n 个值。我试过下面的方法

somelist = []
data = pd.read_excel('newspaper.csv')
data.index = pd.to_datetime(data['Date'], errors='coerce')
last_month = data.loc[data.index[-1]] # i am considering only the previous month(latest month in the dataframe)
last_month.set_index('Newspaper', inplace = True)
for city in last_month.iloc[:, 2: ]:
    top_3 = last_month[city].nlargest(4)[1: ] #highest will be total but we should skip it
    somelist.append(top_3)
print(somelist)

这将生成 pandas 系列的结果,其中列的名称如下所示:

    [Newspaper
    Newspaper1    231295
    Newspaper2    169937
    Newspaper6    137990
    Name: City1, dtype: float64, Newspaper
    Newspaper2    12130.0
    Newspaper1     8391.0
    Newspaper6     8010.0
    Name: City2, dtype: float64, Newspaper
    Newspaper1    240259
    Newspaper2    185189
    Newspaper6    145675
    Name: Region1Total, dtype: float64, Newspaper
    Newspaper6    14500.0
    Newspaper1     8790.0
    Newspaper2     7890.0
    Name: City3, dtype: float64, Newspaper
    Newspaper2    28850.0
    Newspaper6    25330.0
    Newspaper1    21176.0
    Name: City4, dtype: float64, Newspaper
    Newspaper6    36300
    Newspaper2    34460
    Newspaper1    34460
    Name: Region2Total, dtype: float64, Newspaper]

我要的是每个城市和地区销量前三的报纸,以及销量从大到小排列的数字。我还希望在显示前 3 个结果之前打印城市/地区的名称。

预期输出是一个列表或一个系列,如下所示:

Newspaper     City1
Newspaper1    231295
Newspaper2    169937
Newspaper6    137990

Newspaper     City2
Newspaper2    12130.0
Newspaper1     8391.0
Newspaper6     8010.0

Newspaper     Region1Total
Newspaper1    240259
Newspaper2    185189
Newspaper6    145675

Newspaper     City3
Newspaper6    14500.0
Newspaper1     8790.0
Newspaper2     7890.0

Newspaper     City4
Newspaper2    28850.0
Newspaper6    25330.0
Newspaper1    21176.0

Newspaper     Region2Total
Newspaper6    36300
Newspaper2    34460
Newspaper1    34460

此外,如果我想跳过地区而只考虑城市,那么我该怎么做呢? 任何帮助,将不胜感激。先感谢您。

最佳答案

首先,您需要获取一个数据框,其中只列出报纸,而不是全部。

dff = df.loc[df['Newspaper']!='Total']

然后对于city1,你可以这样做:

dff[['Newspaper', 'City1']].sort_values(['City1'], ascending=False).head(3)

输出:

     Newspaper  City1
0   NewsPaper1  231563
1   NewsPaper2  173009
5   NewsPaper6  137650

同样,您可以获得所有感兴趣的列的结果。

关于python - 使用 Pandas 在数据框的列中查找前 3 名,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50863251/

相关文章:

python - 将 pandas 列转换为 "increasing"索引的 pythonic 和 uFunc-y 方法?

python - 从具有相同索引和列的两个 pandas 数据帧执行计算的最快方法

python - 动态语言中的类型类

python - pinax-theme-bootstrap 无法加载 bootstrap

python - 使用 Python Faker 包的不同假数据的最大限制

Python Pandas Cumsum 在多种条件下每次都会重置

python groupby 多列、计数和百分比

python - Pycharm 更新到 2016.2 后导入 RuntimeWarning

python - 在另一个单元格中的plot()之后未渲染AxesSubplot

python - 如何使用时间列创建包含当天部分时间的新列 ['morning' 、 'afternoon' 、 'evening' 、 'night' ]?