r - 如何使用 quanteda 和 kwic 进行模糊模式匹配?

标签 r text-mining quanteda

我有医生写的文本,我希望能够突出显示上下文中的特定单词(我在他们的文本中搜索的单词之前的 5 个单词和之后的 5 个单词)。假设我想搜索“自杀”这个词。然后我将使用 quanteda 包中的 kwic 函数:

kwic(数据集,模式 = “自杀”,窗口 = 5)

到目前为止,一切都很好,但说我想考虑到拼写错误的可能性。在这种情况下,我想允许三个不同的字符,而对它们在单词中的位置没有限制。

是否可以使用 quanteda 的 kwic 函数来做到这一点?

例子:

dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office", 
                                  "On his first appointment, the patient was suicidaa when he showed up in my office",
                                  "On his first appointment, the patient was suiciaaa when he showed up in my office",
                                  "On his first appointment, the patient was suicaaal when he showed up in my office",
                                  "On his first appointment, the patient was suiaaaal when he showed up in my office",
                                  "On his first appointment, the patient was saacidal when he showed up in my office",
                                  "On his first appointment, the patient was suaaadal when he showed up in my office",
                                  "On his first appointment, the patient was icidal when he showed up in my office",
                                  "On his first appointment, the patient was uicida when he showed up in my office"))

dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)

只会给我第一个拼写正确的句子。

最佳答案

很好的问题。我们没有近似匹配作为“值类型”,但这是 future 发展的一个有趣想法。同时,我建议使用 base::agrep() 生成固定模糊匹配列表。然后匹配那些。所以这看起来像:

library("quanteda")
## Package version: 1.5.2

dataset <- data.frame(
  "patient" = 1:9, "text" = c(
    "On his first appointment, the patient was suicidal when he showed up in my office",
    "On his first appointment, the patient was suicidaa when he showed up in my office",
    "On his first appointment, the patient was suiciaaa when he showed up in my office",
    "On his first appointment, the patient was suicaaal when he showed up in my office",
    "On his first appointment, the patient was suiaaaal when he showed up in my office",
    "On his first appointment, the patient was saacidal when he showed up in my office",
    "On his first appointment, the patient was suaaadal when he showed up in my office",
    "On his first appointment, the patient was icidal when he showed up in my office",
    "On his first appointment, the patient was uicida when he showed up in my office"
  ),
  stringsAsFactors = FALSE
)
corp <- corpus(dataset)

# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
  types()

使用agrep()生成最接近的模糊匹配 - 在这里我跑了几次,增加了 max.distance每次都略微偏离默认值 0.1。

# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
  max.distance = 0.3,
  ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal"   "uicida"

然后,将其用作 pattern论据 kwic() :

# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##                                                        
##  [text1, 9] the patient was | suicidal | when he showed
##  [text2, 9] the patient was | suicidaa | when he showed
##  [text3, 9] the patient was | suiciaaa | when he showed
##  [text4, 9] the patient was | suicaaal | when he showed
##  [text5, 9] the patient was | suiaaaal | when he showed
##  [text6, 9] the patient was | saacidal | when he showed
##  [text7, 9] the patient was | suaaadal | when he showed
##  [text8, 9] the patient was |  icidal  | when he showed
##  [text9, 9] the patient was |  uicida  | when he showed

还有基于类似解决方案的其他可能性,例如 模糊连接 stringdist 包,但这是来自 的简单解决方案底座 应该可以很好地工作的包。

关于r - 如何使用 quanteda 和 kwic 进行模糊模式匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59722865/

相关文章:

r - 如何将 LIWC 格式的字典与 R 包 Quanteda 一起使用?

r - quanteda 字典中的逻辑组合

python - 在 R 中使用 Arules 生成 2 项集规则

r - 您何时要在R中设置新环境

r - 如何控制R使用的CPU数量?

mysql - 解析文章内容的维基百科 XML 转储并填充 MySQL 数据库的快速方法是什么?

r - 用 R 提取 ngram

r - R中的NA替换函数

python - 关键字匹配在 pandas 列中给出重复的单词?

r - 在 R 包 Quanteda 中使用半空间