r - 为什么 ngrams() 函数给出不同的二元组？

我正在写一个 R脚本并正在使用库(ngram)。

假设我有一个字符串，

“质量好狗粮购买了重要的 jar 头狗粮产品发现质量好，产品看起来像炖肉，味道更好，拉布拉多菲尼奇鉴赏产品更好”

并想找到双元组。

ngram 库给了我如下的二元组:

“欣赏产品”“加工肉”“食品”“购买的食品”“优质狗”“找到的产品”“产品外观”“看起来像”“像炖肉”“质量好”“拉布拉多菲尼克斯”“买断”“质量产品”“更好的拉布拉多”
“狗粮” “闻起来更好”“重要的 jar 头”“肉味”“发现很好”“切断重要的”“炖过程”“ jar 头狗”“finicki appreci”“产品更好”

由于句子中包含两次“狗粮”，所以我要两次这个二元词。但我得到了一次!

thengram 库或任何其他库中是否有一个选项可以提供我在 R 中的句子的所有二元组？

最佳答案

ngram 开发版有一个 get.phrasetable方法:

devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

此外，您可以使用 print()方法并指定 output == "full" .那是:

print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...

关于r - 为什么 ngrams() 函数给出不同的二元组？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32850155/

r - 为什么 ngrams() 函数给出不同的二元组？

上一篇：r - 需要在 R 中绘制具有标准误差的曲线

下一篇：webgl - WebGL 中超出缓冲区范围的常见原因是什么