我正在使用出色的 tidytext
包来标记多个段落中的句子。例如,我想采取以下段落:
"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."
并将其标记为两个句子
- “我完全相信达西先生没有缺陷。”
- “他自己拥有,毫不掩饰。”
但是,当我使用 tidytext
的默认句子标记器时,我得到三个句子。
代码
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
结果
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
什么是使用 tidytext
标记句子的简单方法,但不会遇到“先生”等常见缩写的问题或“博士”被解释为句尾?
最佳答案
您可以使用正则表达式作为拆分条件,但不能保证这将包括所有常见的恐怖:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")
结果:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
您当然可以随时创建自己的常用标题列表,并根据该列表创建正则表达式:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
关于r - 使用 unnest_tokens() 标记句子,忽略缩写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47211643/