使用 pandas,创建 SQL group by 语句的等效项的最佳方法是什么:
- 每个字段都有不同的聚合函数(例如,我需要 field1、field2 的平均值和 field3 的最大值)
- 稍微多一点 复杂的计算,如 sum(field1)/sum(field2),例如对于加权 平均值
假设我有一个包含城市级别数据的表,我想按国家和地区聚合它。在 SQL 中我会这样写:
select Country, Region
, count(*) as '# of cities'
,sum(GDP) as GDP
,avg(Population) as 'avg # inhabitants per city'
,sum(male_population) / sum(Population) as '% of male population'
from CityTable
group by Country, Region
我怎样才能在 pandas 中做同样的事情?谢谢!
最佳答案
>>> df
Country Region GDP Population male_population
0 USA TX 10 100 50
1 USA TX 11 120 60
2 USA KY 11 200 120
3 Austria Wienna 5 50 34
>>>
>>> df2 = df.groupby(['Country','Region']).agg({'GDP': [np.size, np.sum], 'Population': [np.average, np.sum], 'male_population': np.sum})
>>> df2
GDP male_population Population
size sum sum average sum
Country Region
Austria Wienna 1 5 34 50 50
USA KY 1 11 120 200 200
TX 2 21 110 110 220
>>>
>>> df2['% of male population'] = df2['male_population','sum'].divide(df2['Population','sum'])
>>> df2
GDP male_population Population % of male population
size sum sum average sum
Country Region
Austria Wienna 1 5 34 50 50 0.68
USA KY 1 11 120 200 200 0.60
TX 2 21 110 110 220 0.50
>>>
>>> del df2['male_population', 'sum']
>>> del df2['Population', 'sum']
>>> df2.columns = ['# of cities', 'GDP', 'avg # inhabitants per city', '% of male population']
结果
>>> df2
# of cities GDP avg # inhabitants per city % of male population
Country Region
Austria Wienna 1 5 50 0.68
USA KY 1 11 200 0.60
TX 2 21 110 0.50
关于Python:如何将复杂的SQL聚合语句转换为pandas?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27260003/