通过正则表达式替换 quanteda token

我想明确替换 quanteda 包的 tokens 类对象中定义的特定标记。我未能复制适用于 stringr 的标准方法。

目标是用 c("XXX", "of") 形式的两个标记替换 "XXXof" 形式的所有标记。

请看下面的最小值:

suppressPackageStartupMessages(library(quanteda))
library(stringr)

text = "It was a beautiful day down to the coastof California."

# I would solve this with stringr as follows: 
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."

# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )

# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "It"         "was"        "a"          "beautiful"  "day"       
#>  [6] "down"       "to"         "the"        "\\1 \\2"    "California"
#> [11] "."

任何解决方法？

^{由 reprex package 创建于 2021-03-16 (v1.0.0)}

最佳答案

您可以使用混合来构建需要分隔的单词及其分隔形式的列表，然后使用tokens_replace() 执行替换。这样做的好处是允许您在应用之前整理列表，这意味着您可以验证您没有发现您可能不想应用的替代品。

suppressPackageStartupMessages(library("quanteda"))

toks <- tokens("It was a beautiful day down to the coastof California.")

keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
  strsplit(" ")

keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"

tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
##  [1] "It"         "was"        "a"          "beautiful"  "day"       
##  [6] "down"       "to"         "the"        "coast"      "of"        
## [11] "California" "."

^{由 reprex package 创建于 2021-03-16 (v1.0.0)}

关于通过正则表达式替换 quanteda token ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66651356/

通过正则表达式替换 quanteda token

上一篇：javascript - 这个 Observablehq 示例中的 "invalidation"是什么？

下一篇：vue.js - Vue Test Utils 中的异步生命周期函数