regex - R中将段落拆分成句子

我正在使用 strsplit 函数来执行此操作。

我找到了许多用于此目的的正则表达式:

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

首先，当我简单地在 R 中使用它时，我收到错误:

sl <- unlist(strsplit(txt1,"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"))

错误:'\w' 是以“”(? 开头的字符串中无法识别的转义符

当我尝试测试它时 regex tester

这并不能解决我的问题我的段落是:

As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE. The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

我想要 2 句话

As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

但是上面的正则表达式将其分为 3 个句子:

As of Feb.
9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.

最佳答案

我不明白你想用两个消极的lookbehinds做什么( (?<!\w\.\w.)(?<![A-Z][a-z]\.) )。你真的只需要积极的回顾，你必须在 (?<=\\.|\\?) 之前搜索句点和问号。 (也许添加感叹号？)，空格字符 \\s ，然后为大写字母添加正向前瞻:(?=[A-Z]) .

是的，在 R 中，您需要使用两个反斜杠 ( \\ ) 转义所有内容，并且如果您在 strsplit 中使用前瞻或后视，您需要指定perl = TRUE .

总而言之，您真正需要的是

 strsplit(txt1, "(?<=\\.|\\?)\\s(?=[A-Z])", perl = TRUE)

这给了你

[[1]]
[1] "As of Feb. 9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed with BSE."
[2] "The government has paid $6.1 million in compensation, and is budgeting $16 million for 1990."

关于regex - R中将段落拆分成句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35304900/

regex - R中将段落拆分成句子

上一篇：jodatime - DateTimeComparator 行为与其 javadoc 不一致

下一篇：regex - 将 data.table 中的所有空格和冒号替换为下划线 r