r - 从 R 中的文本中提取字符级 n-gram

标签 r nlp character n-gram

我有一个包含文本的数据框,我想提取字符级双字母组 (n = 2),例如“st”、“ac”、“ck”,用于 R 中的每个文本。

我还想统计文本中每个字符级二元组的出现频率。

数据:

df$text

[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"

最佳答案

我不太确定您在这里的预期输出。我原以为“stack”的二元组将是“st”、“ta”、“ac”和“ck”,因为这会捕获每个连续的对。

例如,如果您想知道单词“brothers”中有多少个二元组“th”,然后将其拆分为二元组“br”、“ot”、“he”和“rs” ",那么你会得到答案 0,这是错误的。

您可以构建一个函数来获取 所有 个二元组,如下所示:

# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes 
# "st", "ta", "ac", and "ck"

pair_chars <- function(char_vec) {
  all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
  return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}

# This function splits a single word into a character vector and gets its bigrams

word_bigrams <- function(words){
  unlist(lapply(strsplit(words, ""), pair_chars))
}

# This function splits a string or vector of strings into words and gets their bigrams

string_bigrams <- function(strings){
  unlist(lapply(strsplit(strings, " "), word_bigrams))
}

所以现在我们可以在您的示例上进行测试:

df <- data.frame(text = c("hy my name is", "stackover flow is great", 
                          "how are you"), stringsAsFactors = FALSE)

string_bigrams(df$text)
#>  [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"

如果你想统计出现次数,你可以使用 table:

table(string_bigrams(df$text))

#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo 
#>  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1  1  1  1  2  2  1  1  1  1 

但是,如果您要进行大量文本挖掘,您应该查看特定的 R 包,例如 stringistringrtmquanteda 帮助完成基本任务

例如,我上面编写的所有基本 R 函数都可以使用 quanteda 包替换,如下所示:

library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#>  [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck" 
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"

reprex package 于 2020-06-13 创建(v0.3.0)

关于r - 从 R 中的文本中提取字符级 n-gram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62359235/

相关文章:

nlp - 如何处理包含名义数据的目标变量?

c - 如何比较两个字符串中的字符以按字母顺序对它们进行排序? (没有c字符串库函数)

java - 将字符串转换为字符数组

c - C 中 printf 输出中的附加 "12"

r - 用于检查空范围的循环

r - 如何在 R 中的 dotchart() 中改变点的大小

r - 如何让这个标签指向最左边的栏?

r - 带有 R : How to disable forward linking? 的 igraph/visNetwork

tensorflow - 如何在TensorFlow GRU模型中添加Attention层?

python - 使用 NLTK 自定义 POS 标记(错误)