r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽

数据

df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
                 major_rock = c("greywacke",    "mudstone", "gravel",   "greywacke",    "gravel",   "mudstone", "gravel", "mudstone", "mudstone",   
                                "conglomerate", "gravel", "mudstone",   "greywacke","conglomerate", "gravel",   "gravel",   "greywacke","gravel",   
                                "greywacke",    "gravel",   "mudstone", "greywacke",    "gravel", "gravel", "gravel",   "conglomerate", "greywacke",
                                "coquina",  "gravel",   "gravel",   "greywacke",    "gravel",   "mudstone","mudstone",  "gravel"),
                 minor_rock = c("sandstone mudstone basalt chert limestone",  "limestone",   "sand silt clay", "sandstone mudstone basalt chert limestone",
                                "sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone",  "mudstone sandstone coquina limestone",
                                "sandstone mudstone limestone",  "sand loess silt",  "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
                                "sand silt clay",  "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sand loess silt", "sand silt clay", "loess silt sand",  "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
                                "limestone", "loess sand silt",  NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
                                "mudstone sandstone coquina limestone", "mudstone tephra loess"),
                 area_ha = c(1066.68,   7.59,   3.41,   4434.76,    393.16, 361.69, 306.75, 124.93, 95.84,  9.3,    8.45,   4565.89,    2600.44,    2198.52,    
                             2131.71,   2050.09,    1640.47,    657.09, 296.73, 178.12, 10403.53,   8389.2,  8304.08,   3853.36,    2476.36,    2451.25,    
                             1640.47,   1023.02,    532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))

我想要什么？

我需要为另一项分析准备 df，该分析要求每个站点只有一行。所以在最终的data.framedf_fin中，每个站点都会有major_rock和minor_rock级别的比例以及列名(变量)将是 major_rock 和 minor_rock 的级别。

我可以对每个变量(major_rock 和 minor_rock)执行此操作，然后将它们组合起来，如下所示

我做了什么？

对于major_rock

library(tidyverse)

df_major_rock <- df %>% 
  dplyr::select(-minor_rock) %>% 
  dplyr::group_by(id, major_rock) %>% 
  dplyr::summarise(total_area = sum(area_ha)) %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(percent_major = total_area/sum(total_area) * 100) %>% 
  dplyr::select(-total_area) %>% 
  tidyr::spread(major_rock, percent_major)

> df_major_rock
Source: local data frame [4 x 6]
Groups: id [4]

      id conglomerate  coquina     gravel greywacke   mudstone
* <fctr>        <dbl>    <dbl>      <dbl>     <dbl>      <dbl>
1  site1           NA       NA  0.3164205  98.97929  0.7042907
2  site2    0.1621656       NA 12.3517842  77.32960 10.1564462
3  site3   13.4720995       NA 30.7432536  27.80577 27.9788787
4  site4    6.1085791 2.549393 39.0992422  25.73366 26.5091274

minor_rock 也是如此

df_minor_rock <- df %>% 
  dplyr::select(-major_rock) %>% 
  dplyr::group_by(id, minor_rock) %>% 
  dplyr::summarise(total_area = sum(area_ha)) %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(percent_minor = total_area/sum(total_area) * 100)%>% 
  dplyr::select(-total_area) %>% 
  tidyr::spread(minor_rock, percent_minor)

> df_minor_rock
Source: local data frame [4 x 15]
Groups: id [4]

      id limestone `loess sand silt` `loess silt sand` `mudstone sandstone coquina limestone` `mudstone tephra loess` `sand loess silt`
* <fctr>     <dbl>             <dbl>             <dbl>                                  <dbl>                   <dbl>             <dbl>
1  site1 0.7042907                NA                NA                                     NA                      NA                NA
2  site2 2.1784240                NA                NA                              1.6711771                      NA          0.147344
3  site3        NA          1.091484         12.562550                                     NA                      NA         13.062701
4  site4 2.8607214          1.328100          6.171154                              0.2719299              0.01213617         20.693984
# ... with 8 more variables: `sand silt clay` <dbl>, `sandstone conglomerate coquina tephra` <dbl>, `sandstone mudstone basalt chert
#   limestone` <dbl>, `sandstone mudstone conglomerate` <dbl>, `sandstone mudstone conglomerate chert limestone basalt` <dbl>, `sandstone
#   mudstone limestone` <dbl>, `sandstone siltstone mudstone limestone silt lignite` <dbl>, `<NA>` <dbl>

然后，我将两个 data.frame 连接在一起(df_major_rock 和 df_minor_rock)，因此最终的 data.frame df_fin 将只有 4 个观测值(每个观测值一行) site)，变量将是 major_rock 和 minor_rock

的级别

df_fin <- df_major_rock %>% 
  dplyr::right_join(., df_minor_rock, by="id")

问题

df_fin 正是我想要的。然而，在这个可重现的示例中，我只显示了 2 个变量(major_rock 和minor_rock)，我必须创建两个不同的 data.frames 来获取每个变量的级别比例，然后将它们连接在一起以获得最终输出 df_fin。在我的实际数据中，除了 major_rock 和 minor_rock 之外，我还有许多变量，我也想获取每个站点的级别比例。我认为应该有比我的方法更直接或更简短的方法。有什么建议将不胜感激吗？

最佳答案

您可以使用 data.table::dcast 来缩短这一时间，它将把您的数据分散到列中。然后，您可以使用 rowSums 一步计算百分比。虽然可能有更好的方法来做到这一点，但我将这种方法包装在循环中的每一列中:

df_fin  <- data.frame(id = unique(df$id))
myColumns <- setdiff(colnames(df)[-1], "area_ha")

for (name in myColumns){
  dcastFormula <- paste0("id ~ ", name)
  tempdf <- data.table::dcast(df, dcastFormula, sum)
  tempdf[,-1] <-  tempdf[,-1]/rowSums(tempdf[,-1],na.rm = TRUE)*100
  df_fin  <- left_join(df_fin , tempdf, by ="id")
}

与往常一样，可能还有其他几种方法可以做到这一点，但这是一个比您的起始位置更简单的示例。此外，它可能需要根据您的其他列和/或您希望如何聚合它们进行修改。

关于r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43713911/

r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽

上一篇：asp.net - ASP.NET View 编译的频率和时间

下一篇：Angular 2 - 在构造函数中构建日期 - 类型 'Date' 不可分配给类型 'string'