r - R 中另一个数据帧的字符串匹配和替换的快速方法

标签 r stringi

我有两个数据帧,如下所示(尽管第一个数据帧的长度超过 9000 万行,第二个数据帧的长度略多于 1400 万行)此外,第二个数据帧是随机排序的

df1 <- data.frame(
  datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
               "strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
               "the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
               "anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
               "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
               "the first known use of this word was in 1539"),
  words = c("anarchist_schools_of_thought  individualism  collectivism", "social_anarchism  individualist_anarchism",
            "anarchy  -ism", "privative  privative_alpha", "infinitive", ""),

  stringsAsFactors=FALSE)

df2 <- data.frame(
  vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",                
                 "anarchy","-ism","privative","privative_alpha", "1310_the_ticket",  "infinitive"),
  token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism",  "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
            "-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"), 
  stringsAsFactors = F)

我能够将短语“wiki/”后面的所有单词提取到另一列中。这些单词需要替换为与第二个数据帧中的词汇匹配的标记列。因此,例如,我会查看第一个数据帧第一行中 wiki/之后的作品“anarchist_schools_of_thought”,然后在第二个数据帧中的词汇词下找到术语“anarchist_schools_of_thought”,我想将其替换为相应的 token 是“Anarchist_schools_of_thought”。

所以它最终应该看起来像这样:

1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539

我意识到其中很多只是将单词的第一个字母大写,但其中一些却有显着不同。我可以做一个 for 循环,但我认为这会花费太多时间,我更喜欢以 data.table 方式或可能以 stringi 或 stringr 方式执行此操作。我通常只会进行合并,但由于一行中有多个单词需要替换,这使事情变得复杂。

预先感谢您的帮助。

最佳答案

您可以使用 stringr 中的 str_replace_all 来执行此操作:

library(stringr)

str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))

基本上,str_replace_all 允许您提供一个命名向量,其中原始字符串作为名称,替换作为向量的元素。您通过创建字符串和替换的“字典”完成了所有艰苦的工作。 str_replace_all 只是简单地接受它并自动进行替换。

结果:

[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"              
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"                               
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"                                              
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"                                       
[6] "the first known use of this word was in 1539"

关于r - R 中另一个数据帧的字符串匹配和替换的快速方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50258443/

相关文章:

r - 在 R 中查找唯一元组但忽略顺序

r - dplyr:如何处理多个值

regex - R/正则表达式与 stringi/ICU : why is a '+' considered a non-[:punct:] character?

r - stringr::str_sub 输出意外

javascript - R:如何在 Shiny 中初始化数据表 FixedColumns javascript?

r - geom_text 和暂停动画的问题

r - 来自 dplyr 的 bind_rows() 问题 - 包加载错误?

r - 通过计算特定字符来对字符串进行分组

regex - 如何使用 OpenNLP 和 stringi 检测句子边界?

r - 计算一行中的唯一字符串模式