r - 使用 unnest_tokens() 标记句子，忽略缩写

我正在使用出色的 tidytext 包来标记多个段落中的句子。例如，我想采取以下段落:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

并将其标记为两个句子

“我完全相信达西先生没有缺陷。”
“他自己拥有，毫不掩饰。”

但是，当我使用 tidytext 的默认句子标记器时，我得到三个句子。

代码

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

结果

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

什么是使用 tidytext 标记句子的简单方法，但不会遇到“先生”等常见缩写的问题或“博士”被解释为句尾？

最佳答案

您可以使用正则表达式作为拆分条件，但不能保证这将包括所有常见的恐怖:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\\b\\p{L}r)\\.")

结果:

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise

您当然可以随时创建自己的常用标题列表，并根据该列表创建正则表达式:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)

关于r - 使用 unnest_tokens() 标记句子，忽略缩写，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47211643/

r - 使用 unnest_tokens() 标记句子，忽略缩写

上一篇：Python 断言 isinstance() 向量

下一篇：java - 如何在不使用临时文件的情况下从 Java 中的嵌套 zip 文件中读取数据？