我收到的数据格式不是很好(我无法在上游更改它们)。有一列需要重新排序并根据特定关键字拆分为 10 多个其他列。
这是我收到的数据示例 - 对于每个人,他们选择了 3 种不同的食物。他们对每种食物类别(food1
、food2
、food3
)的选择紧跟在文本之后:
list1 <- c(' food1 pasta food2 apple food3 carrot ')
list2 <- c(' food2 banana food3 cucumber food1 brown rice ')
list3 <- c(' food3 bell pepper food2 plum food1 bread ')
foodListDF <- as.data.frame(matrix(c(1,2,3, list1, list2, list3), nrow = 3), stringsAsFactors = FALSE)
colnames(foodListDF) <- c('Person', 'Choices')
foodListDF
Person Choices
1 1 food1 pasta food2 apple food3 carrot
2 2 food2 banana food3 cucumber food1 brown rice
3 3 food3 bell pepper food2 plum food1 bread
以上是我接收数据的格式。我的最终目标是将 Choices
列拆分为 3 个单独的列,分别标记为 food1、food2 和 food3,这需要正确排序:
Person food1 food2 food3
1 1 pasta apple carrot
2 2 brown rice banana cucumber
3 3 bread plum bell pepper
我知道我可以像这样拆分选项:
library(stringr)
as.data.frame(str_split_fixed(foodListDF$Choices, c(' food1 | food2 | food3 '), 4))[,2:4]
V2 V3 V4
1 pasta apple carrot
2 banana cucumber brown rice
3 bell pepper plum bread
但这显然没有将它们分成非常必要的适当组/顺序。
我真的只是在努力思考如何从适合每个人的适当人群中提取正确的食物。有什么想法吗?
最佳答案
您可以分别提取食物编号和食物项目(t1
和t2
),将它们连接在一起,unnest
数据并获取它成宽幅面。
library(dplyr)
library(tidyr)
foodListDF %>%
mutate(food = stringr::str_extract_all(Choices, 'food\\d+')) %>%
select(-Choices) -> t1
foodListDF %>%
separate_rows(Choices, sep = 'food\\d+') %>%
filter(Choices != ' ') %>%
mutate(Choices = trimws(Choices)) %>%
group_by(Person) %>%
summarise(col = list(Choices)) -> t2
inner_join(t1, t2, by = 'Person') %>%
unnest(c(food, col)) %>%
pivot_wider(names_from = food, values_from = col)
# Person food1 food2 food3
# <chr> <chr> <chr> <chr>
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper
关于r - 将字符串拆分为多列(按特定顺序),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65418260/