文本挖掘新手尝试提取各种字符并更新列。我尝试过使用 str_extract 但似乎无法处理正则表达式语法。有人可以告诉我吗?谢谢!
可重复的数据
data.frame("name" = c("D1. Hi my name", "A3.3. Hello this is"), "Amount" = c(1, 4))
name Amount
1 D1. Hi my name 1
2 A3.3. Hello this is 4
预期输出
name Amount New Name Extracted
1 D1. Hi my name 1 Hi my name D1.
2 A3.3. Hello this is 4 Hello this is A3.3.
最佳答案
我们可以使用 tidyr
中的extract
。在这里,我们通过匹配不是空格(\\S+
)后跟空格的模式来捕获并捕获第二组字符
library(tidyverse)
df2 %>%
extract(name, into = c("Extracted", "NewName"), "^(\\S+) (.*)",
remove = FALSE) %>%
select(names(df1),NewName, Extracted)
# name Amount NewName Extracted
#1 D1. Hi my name 1 Hi my name D1.
#2 A3.3. Hello this is 4 Hello this is A3.3.
或者使用base R
,我们可以使用sub
创建分隔符,然后使用read.csv
cbind(df2, read.csv(text = sub("\\s", ",", df2$name),
header = FALSE, col.names = c("Extracted", "NewName")))
关于多个组的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57798129/