我想按位置名称总结 df。数据看起来像这样:
location <- c("NY", "NC", "KA", "TX", "AZ", "NC", "SC", "ND", "SD", "MN","WA","MA","VT","CA","OR","NJ","OH","MI","IL","GA","FL")
tree_type <- c("pine", "birch", "maple", "palm")
df <- data.frame(location = sample(location, 20, replace = TRUE),
tree_type = sample(tree_type, 20, replace = TRUE),
density = runif(20, min = 24, max = 365),
income = runif(20, min = 37000, max = 62000))
我想要的是这个:
location mean(density) mean(income) birch maple palm pine
1 AZ 38.44009 52032.95 0 0 1 0
2 CA 136.85112 42243.35 0 1 0 0
3 GA 101.24081 53405.60 2 0 0 0
4 IL 172.02651 46368.42 1 1 0 0
5 MA 198.69868 51117.18 0 0 0 1
6 MI 153.93358 60425.87 1 0 0 0
7 MN 185.05276 46468.68 0 0 1 0
8 NC 181.42187 46007.93 1 0 2 0
9 NJ 302.66541 59316.94 0 0 2 0
10 OR 303.88283 48497.03 0 0 0 2
11 SC 84.05136 50348.41 0 1 0 1
12 SD 158.47423 57894.27 0 0 1 0
13 VT 126.32967 42853.04 0 0 1 0
我是这样做的:
require(dplyr)
require(reshape2)
df_quantvars <- df %>% group_by(location) %>% summarise(mean(density), mean(income))
df_catvarslong <- as.data.frame(table(df[1:2]))
df_catvarswide <- dcast(df_catvarslong, location ~ tree_type, value.var = "Freq")
final_df <- left_join(df_quantvars, df_catvarswide, by = "location")
在
dplyr
中没有办法做到这一点吗? group_by 成语?冒着听起来很愚蠢的风险,我尝试这样做:df_quantvars <- df %>% group_by(location) %>% summarise(mean(density), mean(income), table(df[1:2]))
我错过了什么?
最佳答案
这个回复有点晚了,但我已经投入了一些工作。一次性完成这一切有点棘手。这似乎有效:
首先我使用 group_by(location, tree_type)
计算所有的树,然后我使用 group_by(location)
以获得所需的手段。然后我用 select(-c(density, income)
删除原始密度和收入类别并留下重复的行,但正确的聚合计数。然后我用 distinct()
删除重复项然后使用 spread()
来自 tidyr
库根据您的要求转换为宽格式。
library(dplyr)
library(tidyr)
df %>%
arrange(location)%>%
group_by(location, tree_type)%>%
mutate(Count = n())%>%
group_by(location)%>%
mutate(MeanDensity = mean(density),
MeanIncome = mean(income))%>%
ungroup()%>%
select(-c(density, income))%>%
distinct()%>%
spread(key = tree_type, value = Count, fill = 0)
这给了我:
location MeanDensity MeanIncome birch maple palm pine
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 AZ 244.18094 57474.94 0 0 1 0
2 FL 51.90693 42425.36 0 0 0 1
3 GA 341.18643 49385.44 0 0 0 2
4 IL 258.11124 37101.36 0 1 0 0
5 KA 267.92430 59699.20 1 0 0 0
6 MA 87.48623 60632.98 1 0 0 0
7 MI 197.18310 58837.00 0 0 0 1
8 NC 362.48531 50857.42 0 0 1 0
9 ND 315.57415 51465.06 0 0 1 0
10 NJ 233.72886 55877.40 0 0 1 1
11 NY 283.41522 49275.58 0 1 0 1
12 OH 350.23362 40901.73 0 0 1 0
13 OR 267.68415 38954.04 0 2 0 0
14 SC 260.12169 52837.10 0 1 0 0
15 SD 76.29782 54986.76 0 1 0 0
16 VT 341.80646 44547.77 1 0 0 0
关于r - R中由带有dplyr的另一列分组的分类值的计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31956104/