我正在学习 Hadly 的“R for Data Science”一书,并试图 复制 pandas 中的代码。
我遇到了这个问题:
我必须根据延迟时间创建一个新的排名列
航类并仅过滤它们的最小值和最大值。
R 代码:
library(nycflights13)
library(dplyr)
# remove nans
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
# create new column of rank based on dep_time for each day.
df = not_cancelled %>%
group_by(year,month,day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r)) %>% # filter only first and last value
select(year,month,day,dep_delay,arr_delay,r)
dim(df)
head(df,10)
这给出:
m=month d =day dl = dep_delay ad = arr_delay r =r
year m d dl ad r
2013 1 1 2 11 831
2013 1 1 -3 -12 1
2013 1 2 43 36 928
2013 1 2 -5 -24 1
2013 1 3 33 22 900
2013 1 3 -10 -11 1
2013 1 4 26 23 908
2013 1 4 -1 -8 1
2013 1 4 -1 -9 1 # Behold! january 4 has 3 rows!!
2013 1 5 15 18 717
我正试图在 Pandas 中复制这个:
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# print(df.shape)
# print(df.iloc[:5,:5])
not_cancelled = df.dropna(subset=['dep_delay','arr_delay'])
df['r'] = not_cancelled.groupby(['year','month','day'])['dep_time']\
.rank('min',ascending=False)
g = df.groupby(['year','month','day'])['r']
g = g.agg([min,max]).reset_index()
f = g.head()
print(f)
Python 输出:
(336776, 19)
year month day min max
0 2013 1 1 1.0 831.0
1 2013 1 2 1.0 928.0
2 2013 1 3 1.0 900.0
3 2013 1 4 1.0 908.0
4 2013 1 5 1.0 717.0
这不太对。如何做正确的事?
感谢您的帮助。向 Pandas 致敬!
最佳答案
这是正确的输出,你只需要 reshape 你的输出
方法一堆栈
g = df.groupby(['year','month','day'])['r']
g = g.agg([min,max]).stack()
g=g.reset_index(level=[0,1,2])
方法二 melt
g=df.groupby(['year','month','day'])['r'].agg([min,max])
g.reset_index().melt(['year','month','day'])
更新
g = df.groupby(['year','month','day'])['r']
g_max = g.transform('max')
g_min = g.transform('min')
yourdf=df.loc[(df.r==g_max)|(df.r==g_min),['year','month','day','r']]
关于python - 在 pandas groupby 之后只过滤少数组元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55786478/