regex - R 正则表达式 : extracting speaker in a script

我想使用 R 从脚本中提取说话者，格式如下例所示:

“场景 6:第二领主:不，我的大人，让他去吧；让他为所欲为。第一领主:如果大人发现他不是一个隐藏的人，请不要再尊重我。第二领主在我的生命中，我的大人，一个泡沫。伯特伦:你认为我到目前为止被他欺骗了吗？第二个大人:相信它，我的大人，根据我自己的直接知识，没有任何恶意，但说他是我的亲戚，他是一个最著名的胆小鬼，一个无穷无尽的骗子，一个时常背信弃义的人，没有一个值得大人款待的好品质的拥有者。”

在这个例子中，我想提取:("Second Lord", "First Lord", "Second Lord", "BERTRAM", "Second Lord")。规则很明显:它是位于句末和半列之间的词组。

我怎样才能用 R 写这个？

最佳答案

也许是这样的:

library(stringr)  
body <- "Scene 6: Second Lord: Nay, good my lord, put him to't; let him have his way. First Lord: If your lordship find him not a hilding, hold me no more in your respect. Second Lord: On my life, my lord, a bubble. BERTRAM: Do you think I am so far deceived in him? Second Lord: Believe it, my lord, in mine own direct knowledge, without any malice, but to speak of him as my kinsman, he's a most notable coward, an infinite and endless liar, an hourly promise-breaker, the owner of no one good quality worthy your lordship's entertainment." 
p <- str_extract_all(body, "[:.?] [A-z ]*:")

# and get rid of extra signs
p <- str_replace_all(p[[1]], "[?:.]", "")
# strip white spaces
p <- str_trim(p)
p
"Second Lord" "First Lord"  "Second Lord" "BERTRAM"     "Second Lord"

# unique players
unique(p)
[1] "Second Lord" "First Lord"  "BERTRAM"

正则表达式的解释:(不完美)

str_extract_all(body, "[:.?] [A-z ]*:") 匹配以 : 或 开始。或 ? ([:.?]) 后跟一个空格。匹配任何字符和空格，直到下一个 :。

获取位置

您可以将 str_locate_all 与相同的正则表达式一起使用:

str_locate_all(body, "[:.?] [A-z ]*:")

关于regex - R 正则表达式 : extracting speaker in a script，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11358100/

regex - R 正则表达式 : extracting speaker in a script

正则表达式的解释:(不完美)

获取位置

上一篇：asp.net-mvc-4 - 如何让你的 MVC Controller 在 Edit->Save->ValidateFail 中保持干燥

下一篇：java - 将流收集到LinkedList