数据
df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
major_rock = c("greywacke", "mudstone", "gravel", "greywacke", "gravel", "mudstone", "gravel", "mudstone", "mudstone",
"conglomerate", "gravel", "mudstone", "greywacke","conglomerate", "gravel", "gravel", "greywacke","gravel",
"greywacke", "gravel", "mudstone", "greywacke", "gravel", "gravel", "gravel", "conglomerate", "greywacke",
"coquina", "gravel", "gravel", "greywacke", "gravel", "mudstone","mudstone", "gravel"),
minor_rock = c("sandstone mudstone basalt chert limestone", "limestone", "sand silt clay", "sandstone mudstone basalt chert limestone",
"sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone", "mudstone sandstone coquina limestone",
"sandstone mudstone limestone", "sand loess silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
"sand silt clay", "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sand loess silt", "sand silt clay", "loess silt sand", "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
"limestone", "loess sand silt", NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
"mudstone sandstone coquina limestone", "mudstone tephra loess"),
area_ha = c(1066.68, 7.59, 3.41, 4434.76, 393.16, 361.69, 306.75, 124.93, 95.84, 9.3, 8.45, 4565.89, 2600.44, 2198.52,
2131.71, 2050.09, 1640.47, 657.09, 296.73, 178.12, 10403.53, 8389.2, 8304.08, 3853.36, 2476.36, 2451.25,
1640.47, 1023.02, 532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))
我想要什么?
我需要为另一项分析准备 df,该分析要求每个站点只有一行。所以在最终的data.framedf_fin
中,每个站点都会有major_rock
和minor_rock
级别的比例以及列名(变量)将是 major_rock
和 minor_rock
的级别。
我可以对每个变量(major_rock
和 minor_rock
)执行此操作,然后将它们组合起来,如下所示
我做了什么?
对于major_rock
library(tidyverse)
df_major_rock <- df %>%
dplyr::select(-minor_rock) %>%
dplyr::group_by(id, major_rock) %>%
dplyr::summarise(total_area = sum(area_ha)) %>%
dplyr::group_by(id) %>%
dplyr::mutate(percent_major = total_area/sum(total_area) * 100) %>%
dplyr::select(-total_area) %>%
tidyr::spread(major_rock, percent_major)
> df_major_rock
Source: local data frame [4 x 6]
Groups: id [4]
id conglomerate coquina gravel greywacke mudstone
* <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 site1 NA NA 0.3164205 98.97929 0.7042907
2 site2 0.1621656 NA 12.3517842 77.32960 10.1564462
3 site3 13.4720995 NA 30.7432536 27.80577 27.9788787
4 site4 6.1085791 2.549393 39.0992422 25.73366 26.5091274
minor_rock 也是如此
df_minor_rock <- df %>%
dplyr::select(-major_rock) %>%
dplyr::group_by(id, minor_rock) %>%
dplyr::summarise(total_area = sum(area_ha)) %>%
dplyr::group_by(id) %>%
dplyr::mutate(percent_minor = total_area/sum(total_area) * 100)%>%
dplyr::select(-total_area) %>%
tidyr::spread(minor_rock, percent_minor)
> df_minor_rock
Source: local data frame [4 x 15]
Groups: id [4]
id limestone `loess sand silt` `loess silt sand` `mudstone sandstone coquina limestone` `mudstone tephra loess` `sand loess silt`
* <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 site1 0.7042907 NA NA NA NA NA
2 site2 2.1784240 NA NA 1.6711771 NA 0.147344
3 site3 NA 1.091484 12.562550 NA NA 13.062701
4 site4 2.8607214 1.328100 6.171154 0.2719299 0.01213617 20.693984
# ... with 8 more variables: `sand silt clay` <dbl>, `sandstone conglomerate coquina tephra` <dbl>, `sandstone mudstone basalt chert
# limestone` <dbl>, `sandstone mudstone conglomerate` <dbl>, `sandstone mudstone conglomerate chert limestone basalt` <dbl>, `sandstone
# mudstone limestone` <dbl>, `sandstone siltstone mudstone limestone silt lignite` <dbl>, `<NA>` <dbl>
然后,我将两个 data.frame 连接在一起(df_major_rock
和 df_minor_rock
),因此最终的 data.frame df_fin 将只有 4 个观测值(每个观测值一行) site),变量将是 major_rock
和 minor_rock
df_fin <- df_major_rock %>%
dplyr::right_join(., df_minor_rock, by="id")
问题
df_fin
正是我想要的。然而,在这个可重现的示例中,我只显示了 2 个变量(major_rock 和minor_rock),我必须创建两个不同的 data.frames 来获取每个变量的级别比例,然后将它们连接在一起以获得最终输出 df_fin
。在我的实际数据中,除了 major_rock
和 minor_rock
之外,我还有许多变量,我也想获取每个站点的级别比例。我认为应该有比我的方法更直接或更简短的方法。有什么建议将不胜感激吗?
最佳答案
您可以使用 data.table::dcast
来缩短这一时间,它将把您的数据分散到列中。然后,您可以使用 rowSums 一步计算百分比。虽然可能有更好的方法来做到这一点,但我将这种方法包装在循环中的每一列中:
df_fin <- data.frame(id = unique(df$id))
myColumns <- setdiff(colnames(df)[-1], "area_ha")
for (name in myColumns){
dcastFormula <- paste0("id ~ ", name)
tempdf <- data.table::dcast(df, dcastFormula, sum)
tempdf[,-1] <- tempdf[,-1]/rowSums(tempdf[,-1],na.rm = TRUE)*100
df_fin <- left_join(df_fin , tempdf, by ="id")
}
与往常一样,可能还有其他几种方法可以做到这一点,但这是一个比您的起始位置更简单的示例。此外,它可能需要根据您的其他列和/或您希望如何聚合它们进行修改。
关于r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43713911/