我正在使用 strsplit 函数来执行此操作。
我找到了许多用于此目的的正则表达式:
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
首先,当我简单地在 R 中使用它时,我收到错误:
sl <- unlist(strsplit(txt1,"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"))
错误:'\w' 是以“”(? 开头的字符串中无法识别的转义符
当我尝试测试它时 regex tester
这并不能解决我的问题 我的段落是:
As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE. The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
我想要 2 句话
As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
但是上面的正则表达式将其分为 3 个句子:
As of Feb.
9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
最佳答案
我不明白你想用两个消极的lookbehinds做什么( (?<!\w\.\w.)(?<![A-Z][a-z]\.)
)。你真的只需要积极的回顾,你必须在 (?<=\\.|\\?)
之前搜索句点和问号。 (也许添加感叹号?),空格字符 \\s
,然后为大写字母添加正向前瞻:(?=[A-Z])
.
是的,在 R 中,您需要使用两个反斜杠 ( \\
) 转义所有内容,并且如果您在 strsplit
中使用前瞻或后视,您需要指定perl = TRUE
.
总而言之,您真正需要的是
strsplit(txt1, "(?<=\\.|\\?)\\s(?=[A-Z])", perl = TRUE)
这给了你
[[1]]
[1] "As of Feb. 9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed with BSE."
[2] "The government has paid $6.1 million in compensation, and is budgeting $16 million for 1990."
关于regex - R中将段落拆分成句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35304900/