r - R 中的 "Bag of characters"n 元语法

标签 r machine-learning nlp tokenize n-gram

我想创建一个包含字符 n 元语法的术语文档矩阵。例如，采用以下句子:

“在本文中，我们专注于一种不同但简单的文本表示。”

字符 4-gram 为:|In_t|、|n_th|、|_thi|、|this|、|his__|、|is_p|、|s_pa|、|_pap|、|pape|、|aper|、等等

我已经使用 R/Weka 包来处理“词袋”n-gram，但我在调整标记器(例如下面的标记器)来处理字符时遇到了困难:

BigramTokenizer <- function(x){
    NGramTokenizer(x, Weka_control(min = 2, max = 2))}

tdm_bigram <- TermDocumentMatrix(corpus,
                                 control = list(
                                 tokenize = BigramTokenizer, wordLengths=c(2,Inf)))

关于如何使用 R/Weka 或其他包创建字符 n-gram 有什么想法吗？

最佳答案

我发现quanteda非常有用:

library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
#       Docs
# Terms  text1 text2
#   In_t     1     1
#   n_th     1     1
#   _thi     1     2
#   this     1     2
#   his_     1     1
#   is_p     1     0
#   s_pa     1     0
#   _pap     1     0
#   pape     1     0
#   aper     1     0
#   per.     1     0
#   is_l     0     1
#   s_li     0     1
#   _lin     0     1
#   line     0     1
#   ines     0     1
#   nes_     0     1
#   es_t     0     1
#   s_th     0     1
#   his.     0     1

关于r - R 中的 "Bag of characters"n 元语法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34581065/

上一篇：python-2.7 - 在 python scikit-learn 中，RBF 内核的性能比 SVM 中的线性性能差得多

下一篇：r - 从决策树进行预测的高效算法(使用 R)

nlp - 概率潜在语义分析/索引 - 简介

r - 按两个数字对列名称进行排序

r - 如何从 R 中的 Blasula 电子邮件中删除灰色边框

algorithm - 将函数应用于 R 中的距离矩阵

r - 如何使用 R 获得相邻的组合对？

python - 合并多个 CNN

java - Scala 中的贝叶斯网络

machine-learning - 合并具有不同输入形状的不同模型的输出

r - tidytext 从文件夹中读取文件