r - 仅使用 tidyverse 将列中的类别字符串提取到列中。在R中

标签 r tidyverse

我正在尝试仅使用基于 Comorbitidy_one、Comorbidity_two、Comorbitidy_3 等中找到的类别(字符串)的 tidyverse 库来获取新列。我打算使用新列进行逻辑回归。因此,以列中的字符串命名的新列应该是二进制的(0 和 1)。 0 缺席,1 出席。例如,Comorbidity_one 具有“哮喘(使用吸入器管理)”,但它可能会或可能不会出现在接下来的内容中,因此 “哮喘(使用吸入器管理)” 成为一个新列,其中 1 表示患有这种疾病的患者,0 表示没有这种疾病的患者。但我在 Comorbidity_two 中可能有 Obesity ,然而,这成为一个新列,其中 1 用于治疗肥胖的患者。等等

这是我拥有的表格类型:

test <- structure(
  list(
    ID = c("1",
           "2", "3",
           "4", "5",
           "6"),
    Chills = c("No", "Mild", "No", "Mild", "No", "No"),
    Cough = c("No", "Severe", "No", "Mild", "Mild", "No"),
    Diarrhoea = c("No", "Mild", "No", "No", "No", "No"),
    Fatigue = c("No", "Moderate", "Mild", "Mild", "Mild", "Mild"),
    Headcahe = c("No", "No", "No", "Mild", "No", "No"),
    `Loss of smell and taste` = c("No", "No", "No", "No", "No", "No"),
    `Muscle Ache` = c("No", "Moderate", "No", "Moderate", "Mild", "Mild"),
    `Nasal Congestion` = c("No", "No", "No", "No", "Mild", "No"),
    `Nausea and Vomiting` = c("No", "No",
                              "No", "No", "No", "No"),
    `Shortness of Breath` = c("No",
                              "Mild", "No", "No", "No", "Mild"),
    `Sore Throat` = c("No",
                      "No", "No", "No", "Mild", "No"),
    Sputum = c("No", "Mild",
               "No", "Mild", "Mild", "No"),
    Temperature = c("No", "No",
                    "No", "No", "No", "37.5-38"),
    Comorbidity_one = c(
      "Asthma (managed with an inhaler)",
      "None",
      "Obesity",
      "High Blood Pressure (hypertension)",
      "None",
      "None"
    ),
    Comorbidity_two = c("Diabetes Type 2", NA,
                        NA, "Obesity", NA, NA),
    Comorbidity_three = c(
      "Asthma (managed with an inhaler)",
      "None",
      "Obesity",
      "High Blood Pressure (hypertension)",
      "None",
      NA_character_
    ),
    Comorbidity_four = c(
      "Asthma (managed with an inhaler)",
      "None",
      "High Blood Pressure (hypertension)",
      NA_character_,
      NA_character_,
      NA_character_
    ),
    Comorbidity_five = c(
      "Asthma (managed with an inhaler)",
      "None",
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_
    ),
    Comorbidity_six = c(
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_
    ),
    Comorbidity_seven = c(
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_
    ),
    Comorbidity_eight = c(
      "High Blood Pressure (hypertension)",
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_,
      NA_character_
    ),
    Comorbidity_nine = c(
      NA_character_,
      NA_character_,
      NA_character_,
      "High Blood Pressure (hypertension)",
      NA_character_,
      "High Blood Pressure (hypertension)"
    )
  ),
  row.names = c(NA,-6L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

最佳答案

这是一种方法。

首先,将pivot_longer您的合并症,这样每行就有一个合并症。然后将删除 NA 和重复的合并症。

然后,您可以使用 pivot_wider 为每种合并症设置列,如果存在则为 1,并使用 values_fill 为不存在的 0 而不是 NA.

library(tidyverse)

test %>%
  pivot_longer(cols = starts_with("Comorbidity"), names_to = "Comorbidity_Count", values_to = "Comorbidity") %>%
  drop_na(Comorbidity) %>%
  select(-Comorbidity_Count) %>%
  distinct() %>%
  mutate(Condition = 1) %>%
  pivot_wider(id_cols = -c(Comorbidity, Condition), names_from = Comorbidity, values_from = Condition, values_fill = list(Condition = 0))

输出

# A tibble: 6 x 19
  ID    Chills Cough  Diarrhoea Fatigue  Headcahe `Loss of smell a… `Muscle Ache` `Nasal Congesti… `Nausea and Vom… `Shortness of B… `Sore Throat` Sputum Temperature `Asthma (manage… `Diabetes Type … `High Blood Pre…  None Obesity
  <chr> <chr>  <chr>  <chr>     <chr>    <chr>    <chr>             <chr>         <chr>            <chr>            <chr>            <chr>         <chr>  <chr>                  <dbl>            <dbl>            <dbl> <dbl>   <dbl>
1 1     No     No     No        No       No       No                No            No               No               No               No            No     No                         1                1                1     0       0
2 2     Mild   Severe Mild      Moderate No       No                Moderate      No               No               Mild             No            Mild   No                         0                0                0     1       0
3 3     No     No     No        Mild     No       No                No            No               No               No               No            No     No                         0                0                1     0       1
4 4     Mild   Mild   No        Mild     Mild     No                Moderate      No               No               No               No            Mild   No                         0                0                1     0       1
5 5     No     Mild   No        Mild     No       No                Mild          Mild             No               No               Mild          Mild   No                         0                0                0     1       0
6 6     No     No     No        Mild     No       No                Mild          No               No               Mild             No            No     37.5-38                    0                0                1     1       0

关于r - 仅使用 tidyverse 将列中的类别字符串提取到列中。在R中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61941331/

相关文章:

r mutate_each 函数已弃用

r - 使用 tidyverse 进行条件过滤

r - R tryCatch block 中的变量范围 : is <<- necessary to change local variable defined before tryCatch?

r - 在 R 中使用 geom_boxplot() + geom_jitter() 时如何排除异常值

r - 使用tidyverse的R中多列的加权和

r - 在列表中查找对象的位置编号

r - 在 rlang 中进行嵌套延迟评估的干净方法

r - rollapply 可以返回矩阵列表吗?

r - 使用 kable 对具有相同名称的子列的列进行分组

r - 使用初始值后的前一个值对向量执行操作