R中的正则表达式选择以新行结尾的句子

我的理解是，R 使用扩展正则表达式或类似 Perl 的正则表达式。我已经在 SO 和网络上搜索了此正则表达式问题的解决方案，但结果是空的:

在 R 中，我有一个文本文件向量。每个元素由几个段落组成。我想从每个元素中提取几个句子，以使用该文本子集创建一个新向量。我想要提取的句子遵循可预测的模式。

text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
          "AND \n \n notes: text text/text.\n \n text text \n text",
          "AND \n \n house: text text/text.\n \n text text \n text")

我想提取“house Notes”、“house”或“notes”与第一个“\n”之间的所有文本。 “house Notes”、“house”或“notes”等词可能出现在文档的其他位置，但我对它们第一次出现感兴趣。

> output
"house notes: text text/text.\n",
"notes: text text/text.\n ",
"house: text text/text.\n "

我可以让它在 php 中工作\w++ 注释:\w++\w*+[^_]\w[^:\\]*+\\\w< 但不是 R。

最佳答案

您应该注意，您针对文字 \n 的字符串进行了测试。 (反斜杠 + n )，并且您使用了 PCRE 正则表达式风格( \w++ 包含所有格量词)，并且您需要使用 perl=TRUE在基本 R 正则表达式函数中使用此类正则表达式。

由于您只想获取从特定字符串到换行符的文本，因此最好的模式是一组替代项，然后是否定字符类(匹配除 \n 之外的任何字符)和换行符:

> text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
+           "AND \n \n notes: text text/text.\n \n text text \n text",
+           "AND \n \n house: text text/text.\n \n text text \n text")
> 
> pat = "(house( notes)?|notes):[^\n]*\n"
> regmatches(text, gregexpr(pat, text))
[[1]]
[1] "house notes: text text/text.\n"

[[2]]
[1] "notes: text text/text.\n"

[[3]]
[1] "house: text text/text.\n"

详细信息:

(house( notes)?|notes) - 匹配 house 的组, house notes ，或notes
: - 冒号
[^\n]* - 否定字符类匹配除换行符之外的任何字符
\n - 换行符。

关于R中的正则表达式选择以新行结尾的句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40455648/

R中的正则表达式选择以新行结尾的句子

上一篇：etl - 按照计划操作 AWS Redshift 中的数据

下一篇：sorting - Lua 迭代按值排序的表