我有以下字符向量:
"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
我想使用以下模式(即句点 - 空格 - 大写字母)将其拆分成句子:
"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"
因此,缩写后的句点不应该是一个新句子。我想在 R 中使用正则表达式来执行此操作。
有人可以帮助我吗?
使用 strsplit 的解决方案:
string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
结果:
[1] "This is a very long character vector."
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"
这匹配任何标点符号后跟一个空格和一个大写字母。 (?<=[[:punct:]])
在匹配的定界符和 (?=[A-Z])
之前保留字符串中的标点符号将匹配的大写字母添加到匹配的分隔符之后的字符串。
编辑:
我只是看到您在所需输出中的问号后没有拆分。如果您只想在“.”之后拆分你可以使用这个:
unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))
给出
[1] "This is a very long character vector."
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"