我正在尝试仅使用基于 Comorbitidy_one、Comorbidity_two、Comorbitidy_3 等中找到的类别(字符串)的 tidyverse
库来获取新列。我打算使用新列进行逻辑回归。因此,以列中的字符串命名的新列应该是二进制的(0 和 1)。 0 缺席,1 出席。例如,Comorbidity_one 具有“哮喘(使用吸入器管理)”
,但它可能会或可能不会出现在接下来的内容中,因此 “哮喘(使用吸入器管理)”
成为一个新列,其中 1 表示患有这种疾病的患者,0 表示没有这种疾病的患者。但我在 Comorbidity_two
中可能有 Obesity
,然而,这成为一个新列,其中 1 用于治疗肥胖的患者。等等
这是我拥有的表格类型:
test <- structure(
list(
ID = c("1",
"2", "3",
"4", "5",
"6"),
Chills = c("No", "Mild", "No", "Mild", "No", "No"),
Cough = c("No", "Severe", "No", "Mild", "Mild", "No"),
Diarrhoea = c("No", "Mild", "No", "No", "No", "No"),
Fatigue = c("No", "Moderate", "Mild", "Mild", "Mild", "Mild"),
Headcahe = c("No", "No", "No", "Mild", "No", "No"),
`Loss of smell and taste` = c("No", "No", "No", "No", "No", "No"),
`Muscle Ache` = c("No", "Moderate", "No", "Moderate", "Mild", "Mild"),
`Nasal Congestion` = c("No", "No", "No", "No", "Mild", "No"),
`Nausea and Vomiting` = c("No", "No",
"No", "No", "No", "No"),
`Shortness of Breath` = c("No",
"Mild", "No", "No", "No", "Mild"),
`Sore Throat` = c("No",
"No", "No", "No", "Mild", "No"),
Sputum = c("No", "Mild",
"No", "Mild", "Mild", "No"),
Temperature = c("No", "No",
"No", "No", "No", "37.5-38"),
Comorbidity_one = c(
"Asthma (managed with an inhaler)",
"None",
"Obesity",
"High Blood Pressure (hypertension)",
"None",
"None"
),
Comorbidity_two = c("Diabetes Type 2", NA,
NA, "Obesity", NA, NA),
Comorbidity_three = c(
"Asthma (managed with an inhaler)",
"None",
"Obesity",
"High Blood Pressure (hypertension)",
"None",
NA_character_
),
Comorbidity_four = c(
"Asthma (managed with an inhaler)",
"None",
"High Blood Pressure (hypertension)",
NA_character_,
NA_character_,
NA_character_
),
Comorbidity_five = c(
"Asthma (managed with an inhaler)",
"None",
NA_character_,
NA_character_,
NA_character_,
NA_character_
),
Comorbidity_six = c(
NA_character_,
NA_character_,
NA_character_,
NA_character_,
NA_character_,
NA_character_
),
Comorbidity_seven = c(
NA_character_,
NA_character_,
NA_character_,
NA_character_,
NA_character_,
NA_character_
),
Comorbidity_eight = c(
"High Blood Pressure (hypertension)",
NA_character_,
NA_character_,
NA_character_,
NA_character_,
NA_character_
),
Comorbidity_nine = c(
NA_character_,
NA_character_,
NA_character_,
"High Blood Pressure (hypertension)",
NA_character_,
"High Blood Pressure (hypertension)"
)
),
row.names = c(NA,-6L),
class = c("tbl_df",
"tbl", "data.frame")
)
最佳答案
这是一种方法。
首先,将pivot_longer
您的合并症,这样每行就有一个合并症。然后将删除 NA
和重复的合并症。
然后,您可以使用 pivot_wider
为每种合并症设置列,如果存在则为 1,并使用 values_fill
为不存在的 0 而不是 NA
.
library(tidyverse)
test %>%
pivot_longer(cols = starts_with("Comorbidity"), names_to = "Comorbidity_Count", values_to = "Comorbidity") %>%
drop_na(Comorbidity) %>%
select(-Comorbidity_Count) %>%
distinct() %>%
mutate(Condition = 1) %>%
pivot_wider(id_cols = -c(Comorbidity, Condition), names_from = Comorbidity, values_from = Condition, values_fill = list(Condition = 0))
输出
# A tibble: 6 x 19
ID Chills Cough Diarrhoea Fatigue Headcahe `Loss of smell a… `Muscle Ache` `Nasal Congesti… `Nausea and Vom… `Shortness of B… `Sore Throat` Sputum Temperature `Asthma (manage… `Diabetes Type … `High Blood Pre… None Obesity
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 No No No No No No No No No No No No No 1 1 1 0 0
2 2 Mild Severe Mild Moderate No No Moderate No No Mild No Mild No 0 0 0 1 0
3 3 No No No Mild No No No No No No No No No 0 0 1 0 1
4 4 Mild Mild No Mild Mild No Moderate No No No No Mild No 0 0 1 0 1
5 5 No Mild No Mild No No Mild Mild No No Mild Mild No 0 0 0 1 0
6 6 No No No Mild No No Mild No No Mild No No 37.5-38 0 0 1 1 0
关于r - 仅使用 tidyverse 将列中的类别字符串提取到列中。在R中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61941331/