r - 如何识别重复的单词以及句子中重复的位置和数量

标签 r duplicates

我有一个包含连续单词重复的句子的数据集:

数据:

df <- data.frame(
  Turn = c("oh is that that steak i got the other night",       # that that
           "no no no i 'm dave and you 're alan",               # no no no
           "yeah i mean the the film was quite long though",    # the the
           "it had steve martin in it it 's a comedy"))         # it it

目标:

我想要获得的是添加到此数据框中的另外三列:

  • df$rep_Word : 指定重复单词的列
  • df$rep_Pos : 一列指定单词在句子中重复的第一个位置
  • df$rep_Numb : 指定单词重复次数的列

所以预期的数据框如下所示:

预期结果:

df
                                            Turn rep_Word rep_Pos rep_Numb
1    oh is that that steak i got the other night     that       4        1
2            no no no i 'm dave and you 're alan       no       2        2
3 yeah i mean the the film was quite long though      the       5        1
4       it had steve martin in it it 's a comedy       it       7        1

迄今为止尝试的解决方案:

我的预感是,可以通过 strsplit 获取有关重复单词、位置和重复次数的信息。和函数 duplicated ,例如,因此:

df_split <- apply(df, 2, function(x) strsplit(x, "\\s"))

df_split
$Turn
$Turn[[1]]
 [1] "oh"    "is"    "that"  "that"  "steak" "i"     "got"   "the"   "other" "night"
$Turn[[2]]
 [1] "no"   "no"   "no"   "i"    "'m"   "dave" "and"  "you"  "'re"  "alan"
$Turn[[3]]
 [1] "yeah"   "i"      "mean"   "the"    "the"    "film"   "was"    "quite"  "long"   "though"
$Turn[[4]]
 [1] "it"     "had"    "steve"  "martin" "in"     "it"     "it"     "'s"     "a"      "comedy"

例如,对于 df 中的第一句话, duplicated显示哪个单词被重复(即 duplicated 评估为 TRUE 的单词),并且重复的数量和位置也可以读取该信息:

duplicated(df_split$Turn[[1]])
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

问题是我不知道如何操作 duplicated以在 df 中获得所需的添加列的方式.非常感谢您对这项工作的帮助。

最佳答案

这是解决问题的另一种方法。

df <- data.frame(
  Turn = c("oh is that that steak i got the other night",  # that that
           "no no no i 'm dave and you 're alan",               # no no no
           "yeah i mean the the film was quite long though",    # the the
           "it had steve martin in it it 's a comedy",         # it it)
           "it had steve martin in in it it 's a comedy",
           "yeah i mean the film was quite long though", 
           "hi hi then other words and hi hi again",
           "no no no i 'm dave yes yes and you 're alan no no no no"))  # no no no and no no no no

library(data.table)
cols <- c("rep_Word", "rep_Pos", "rep_Numb")
setDT(df)[, (cols) := {
  words <- strsplit(as.character(Turn), " ")[[1]]
  idx <- rleid(words)
  check <- duplicated(idx)
  chg <- check - shift(check, fill = FALSE)
  starts <- which(chg == 1)
  aend <- if(sum(chg) == 0L) which(chg == -1) else c(which(chg == -1), length(chg) + 1L)
  freq <- aend - starts
  wrd <- words[starts]
  no_dup_default <- .(.(NA_character_), .(NA_integer_), .(NA_integer_))
  if(length(wrd)) .(.(wrd), .(starts), .(freq)) else no_dup_default
}, seq.int(nrow(df))]


df
#                                                       Turn   rep_Word  rep_Pos rep_Numb
# 1:             oh is that that steak i got the other night       that        4        1
# 2:                     no no no i 'm dave and you 're alan         no        2        2
# 3:          yeah i mean the the film was quite long though        the        5        1
# 4:                it had steve martin in it it 's a comedy         it        7        1
# 5:             it had steve martin in in it it 's a comedy      in,it      6,8      1,1
# 6:              yeah i mean the film was quite long though         NA       NA       NA
# 7:                  hi hi then other words and hi hi again      hi,hi      2,8      1,1
# 8: no no no i 'm dave yes yes and you 're alan no no no no  no,yes,no  2, 8,14    2,1,3
#                

# or
df[, lapply(.SD, unlist), seq.int(nrow(df))][, -1]
#                                                        Turn rep_Word rep_Pos rep_Numb
#  1:             oh is that that steak i got the other night     that       4        1
#  2:                     no no no i 'm dave and you 're alan       no       2        2
#  3:          yeah i mean the the film was quite long though      the       5        1
#  4:                it had steve martin in it it 's a comedy       it       7        1
#  5:             it had steve martin in in it it 's a comedy       in       6        1
#  6:             it had steve martin in in it it 's a comedy       it       8        1
#  7:              yeah i mean the film was quite long though     <NA>      NA       NA
#  8:                  hi hi then other words and hi hi again       hi       2        1
#  9:                  hi hi then other words and hi hi again       hi       8        1
# 10: no no no i 'm dave yes yes and you 're alan no no no no       no       2        2
# 11: no no no i 'm dave yes yes and you 're alan no no no no      yes       8        1
# 12: no no no i 'm dave yes yes and you 're alan no no no no       no      14        3

关于r - 如何识别重复的单词以及句子中重复的位置和数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60463993/

相关文章:

regex - 是否有更有效的方法使用 dplyr 过滤器从数据框中删除行?

sql - 根据重复数据添加值

android - 检索联系人时出现重复联系人问题

php - 如何向数据库中插入一些与另一行相同的信息?

javascript - JQuery - 附加(复制,克隆)li 并将其放在 li 之上

sql - 索引 : Avoid duplicates in table when Status = 'S'

r - 如何从 R 矩阵按名称访问行和列

使用 scale_x_discrete() 删除图的左侧(从 0 到 13)

r - 如何减少(子集)列表列表?

r - 如何在 R Studio 演示文稿 (Rpres) 中包含 plotly