r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽

标签 r dplyr

数据

df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
                 major_rock = c("greywacke",    "mudstone", "gravel",   "greywacke",    "gravel",   "mudstone", "gravel", "mudstone", "mudstone",   
                                "conglomerate", "gravel", "mudstone",   "greywacke","conglomerate", "gravel",   "gravel",   "greywacke","gravel",   
                                "greywacke",    "gravel",   "mudstone", "greywacke",    "gravel", "gravel", "gravel",   "conglomerate", "greywacke",
                                "coquina",  "gravel",   "gravel",   "greywacke",    "gravel",   "mudstone","mudstone",  "gravel"),
                 minor_rock = c("sandstone mudstone basalt chert limestone",  "limestone",   "sand silt clay", "sandstone mudstone basalt chert limestone",
                                "sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone",  "mudstone sandstone coquina limestone",
                                "sandstone mudstone limestone",  "sand loess silt",  "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
                                "sand silt clay",  "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sand loess silt", "sand silt clay", "loess silt sand",  "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
                                "limestone", "loess sand silt",  NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
                                "mudstone sandstone coquina limestone", "mudstone tephra loess"),
                 area_ha = c(1066.68,   7.59,   3.41,   4434.76,    393.16, 361.69, 306.75, 124.93, 95.84,  9.3,    8.45,   4565.89,    2600.44,    2198.52,    
                             2131.71,   2050.09,    1640.47,    657.09, 296.73, 178.12, 10403.53,   8389.2,  8304.08,   3853.36,    2476.36,    2451.25,    
                             1640.47,   1023.02,    532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))

我想要什么?

我需要为另一项分析准备 df,该分析要求每个站点只有一行。所以在最终的data.framedf_fin中,每个站点都会有major_rockminor_rock级别的比例以及列名(变量)将是 major_rockminor_rock 的级别。

我可以对每个变量(major_rockminor_rock)执行此操作,然后将它们组合起来,如下所示

我做了什么?

对于major_rock

library(tidyverse)

df_major_rock <- df %>% 
  dplyr::select(-minor_rock) %>% 
  dplyr::group_by(id, major_rock) %>% 
  dplyr::summarise(total_area = sum(area_ha)) %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(percent_major = total_area/sum(total_area) * 100) %>% 
  dplyr::select(-total_area) %>% 
  tidyr::spread(major_rock, percent_major)

> df_major_rock
Source: local data frame [4 x 6]
Groups: id [4]

      id conglomerate  coquina     gravel greywacke   mudstone
* <fctr>        <dbl>    <dbl>      <dbl>     <dbl>      <dbl>
1  site1           NA       NA  0.3164205  98.97929  0.7042907
2  site2    0.1621656       NA 12.3517842  77.32960 10.1564462
3  site3   13.4720995       NA 30.7432536  27.80577 27.9788787
4  site4    6.1085791 2.549393 39.0992422  25.73366 26.5091274

minor_rock 也是如此

df_minor_rock <- df %>% 
  dplyr::select(-major_rock) %>% 
  dplyr::group_by(id, minor_rock) %>% 
  dplyr::summarise(total_area = sum(area_ha)) %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(percent_minor = total_area/sum(total_area) * 100)%>% 
  dplyr::select(-total_area) %>% 
  tidyr::spread(minor_rock, percent_minor)

> df_minor_rock
Source: local data frame [4 x 15]
Groups: id [4]

      id limestone `loess sand silt` `loess silt sand` `mudstone sandstone coquina limestone` `mudstone tephra loess` `sand loess silt`
* <fctr>     <dbl>             <dbl>             <dbl>                                  <dbl>                   <dbl>             <dbl>
1  site1 0.7042907                NA                NA                                     NA                      NA                NA
2  site2 2.1784240                NA                NA                              1.6711771                      NA          0.147344
3  site3        NA          1.091484         12.562550                                     NA                      NA         13.062701
4  site4 2.8607214          1.328100          6.171154                              0.2719299              0.01213617         20.693984
# ... with 8 more variables: `sand silt clay` <dbl>, `sandstone conglomerate coquina tephra` <dbl>, `sandstone mudstone basalt chert
#   limestone` <dbl>, `sandstone mudstone conglomerate` <dbl>, `sandstone mudstone conglomerate chert limestone basalt` <dbl>, `sandstone
#   mudstone limestone` <dbl>, `sandstone siltstone mudstone limestone silt lignite` <dbl>, `<NA>` <dbl>

然后,我将两个 data.frame 连接在一起(df_major_rockdf_minor_rock),因此最终的 data.frame df_fin 将只有 4 个观测值(每个观测值一行) site),变量将是 major_rockminor_rock

的级别
df_fin <- df_major_rock %>% 
  dplyr::right_join(., df_minor_rock, by="id")

问题

df_fin 正是我想要的。然而,在这个可重现的示例中,我只显示了 2 个变量(major_rock 和minor_rock),我必须创建两个不同的 data.frames 来获取每个变量的级别比例,然后将它们连接在一起以获得最终输出 df_fin。在我的实际数据中,除了 major_rockminor_rock 之外,我还有许多变量,我也想获取每个站点的级别比例。我认为应该有比我的方法更直接或更简短的方法。有什么建议将不胜感激吗?

最佳答案

您可以使用 data.table::dcast 来缩短这一时间,它将把您的数据分散到列中。然后,您可以使用 rowSums 一步计算百分比。虽然可能有更好的方法来做到这一点,但我将这种方法包装在循环中的每一列中:

df_fin  <- data.frame(id = unique(df$id))
myColumns <- setdiff(colnames(df)[-1], "area_ha")

for (name in myColumns){
  dcastFormula <- paste0("id ~ ", name)
  tempdf <- data.table::dcast(df, dcastFormula, sum)
  tempdf[,-1] <-  tempdf[,-1]/rowSums(tempdf[,-1],na.rm = TRUE)*100
  df_fin  <- left_join(df_fin , tempdf, by ="id")
}

与往常一样,可能还有其他几种方法可以做到这一点,但这是一个比您的起始位置更简单的示例。此外,它可能需要根据您的其他列和/或您希望如何聚合它们进行修改。

关于r - dplyr:计算data.frame中许多列的水平百分比并将其转换为宽,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43713911/

相关文章:

r - 如何删除数据框中包含 R 中某些单词的行?

r - 尝试使用 mgcv::gam "mismatch between nb/polys supplied area names and data area names"评估马尔可夫随机场时出错

r - 根据 R 中其他列中的值创建一个序列

r - 使用 dplyr 对多个分组变量进行计数

将 NA 值替换为 dplyr 中因子变量的模态值

r - 使用 dplyr 为 Group 中的不同值分配唯一 ID

r - ggplot2 中的下标

java - 无法在 mac OS X El Capitan 上使用 R 中的 FSelector 包

使用 mutate 和 rowwise 返回列表

r - 使用 dplyr 通过多个函数传递列名