python - Pandas Groupby 多列 - 前 N

我有一个有趣的!我试图找到一个重复的问题，但没有成功...

我的数据框包含 2013-2016 年的所有美国和地区，具有多个属性。

>>> df.head(2)
     state  enrollees  utilizing  enrol_age65  util_age65  year
1  Alabama     637247     635431       473376      474334  2013
2   Alaska      30486      28514        21721       20457  2013

>>> df.tail(2)
     state               enrollees  utilizing  enrol_age65  util_age65  year
214  Puerto Rico          581861     579514       453181      450150  2016
215  U.S. Territories      24329      16979        22608       15921  2016

我想按年份和州分组，并显示每年排名前 3 位的州(按“登记者”或“利用” - 无关紧要)。

期望的输出:

                                       enrollees  utilizing
year state                                                 
2013 California                          3933310    3823455
     New York                            3133980    3002948
     Florida                             2984799    2847574
...
2016 California                          4516216    4365896
     Florida                             4186823    3984756
     New York                            4009829    3874682

到目前为止，我已经尝试了以下方法:

df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)

这仅产生 GroupBy 对象中的前 3 行:

                 enrollees  utilizing
year state                           
2013 Alabama        637247     635431
     Alaska          30486      28514
     Arizona        707683     683273

我也试过 lambda 函数:

df.groupby(['year','state'])['enrollees','utilizing']\
  .apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')

在 GroupBy 对象中产生绝对最大的 3:

                 enrollees  utilizing
year state                           
2016 California    4516216    4365896
2015 California    4324304    4191704
2014 California    4133532    4011208

我认为这可能与 GroupBy 对象的索引有关，但我不确定...任何指导将不胜感激!

最佳答案

好吧，你可以做一些不太漂亮的事情。

首先使用 set() 获取唯一年份的列表:

years_list = list(set(df.year))

创建一个虚拟数据框和一个函数来连接我过去所做的:

def concatenate_loop_dfs(df_temp, df_full, axis=0):
    """
    to avoid retyping the same line of code for every df.
    the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
    values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """ 

if df_full.empty:
    df_full = df_temp
else:
    df_full = pd.concat([df_full, df_temp], axis=axis)

return df_full

创建虚拟最终 df

df_final = pd.DataFrame()

现在您将循环每年并合并到新的 DF 中:

for year in years_list:
    # The query function does a search for where
    # the @year means the external variable, in this case the input from loop
    # then you'll have a temporary DF with only the year and sorting and getting top3
    df2 = df.query("year == @year")

    df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
    # finally you'll call our function that will keep concating the tmp DFs
    df_final = concatenate_loop_dfs(df_temp, df_final)

完成。

print(df_final)

关于python - Pandas Groupby 多列 - 前 N，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54596360/

python - Pandas Groupby 多列 - 前 N

上一篇：python - 为什么pytest-django找不到manage.py？

下一篇：python - 基于多列对 numpy 文本数组中包含数字的列进行排序