r - 将字符向量拆分成句子

标签 r regex

<分区>

我有以下字符向量:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

我想使用以下模式(即句点 - 空格 - 大写字母)将其拆分成句子:

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

因此,缩写后的句点不应该是一个新句子。我想在 R 中使用正则表达式来执行此操作。

有人可以帮助我吗?

最佳答案

使用 strsplit 的解决方案:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

结果:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?" 

这匹配任何标点符号后跟一个空格和一个大写字母。 (?<=[[:punct:]])在匹配的定界符和 (?=[A-Z]) 之前保留字符串中的标点符号将匹配的大写字母添加到匹配的分隔符之后的字符串。

编辑: 我只是看到您在所需输出中的问号后没有拆分。如果您只想在“.”之后拆分你可以使用这个:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

给出

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"  

关于r - 将字符向量拆分成句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46884556/

相关文章:

R,data.table,按列分组*数字*并对一列求和

php - Preg_match 差异?

c++ - Perl正则表达式中 "~"标记有什么用?

java - 如何在java中使用正则表达式匹配结束括号?

r - 使用包内的自定义方差函数从 gaulss-gams 进行预测时出现环境问题

R:阿尔法!他们什么都不做!

sql - 将正则表达式应用于 R 中的 SQL 数据库

r - 与 ifelse 和 is.na 交叉变异

regex - 无法用grep提取

php - 如何匹配 php 中包含具有固定数量的字母数字字符的 https url 的字符串?