regex - R中将段落拆分成句子

标签 regex r strsplit

我正在使用 strsplit 函数来执行此操作。

我找到了许多用于此目的的正则表达式:

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

首先,当我简单地在 R 中使用它时,我收到错误:

sl <- unlist(strsplit(txt1,"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"))

错误:'\w' 是以“”(? 开头的字符串中无法识别的转义符

当我尝试测试它时 regex tester

这并不能解决我的问题 我的段落是:

As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE. The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

我想要 2 句话

As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

但是上面的正则表达式将其分为 3 个句子:

As of Feb.
9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

最佳答案

我不明白你想用两个消极的lookbehinds做什么( (?<!\w\.\w.)(?<![A-Z][a-z]\.) )。你真的只需要积极的回顾,你必须在 (?<=\\.|\\?) 之前搜索句点和问号。 (也许添加感叹号?),空格字符 \\s ,然后为大写字母添加正向前瞻:(?=[A-Z]) .

是的,在 R 中,您需要使用两个反斜杠 ( \\ ) 转义所有内容,并且如果您在 strsplit 中使用前瞻或后视,您需要指定perl = TRUE .

总而言之,您真正需要的是

 strsplit(txt1, "(?<=\\.|\\?)\\s(?=[A-Z])", perl = TRUE)

这给了你

[[1]]
[1] "As of Feb. 9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed with BSE."
[2] "The government has paid $6.1 million in compensation, and is budgeting $16 million for 1990."   

关于regex - R中将段落拆分成句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35304900/

相关文章:

regex - 使用正则表达式仅选择字母数字词

r - 打印在RStudio中使用R Markdown在函数中生成的图

r - 如何在 R 中将 str_split 与正则表达式一起使用?

r - 从字符串末尾开始使用 strsplit

通过匹配字符串进行 R 频率计数

regex - R从字符串中提取第一个数字

javascript - 如何使用javascript交换字符串中的特定标签

javascript - 如何用正则表达式解析svg `viewBox`属性?

php - 如何使用 preg_replace regEx 将整个 div 替换为另一个

r - 使用read.table读取文本文件