r - 如何使用 R data.table 比较列之间的多个字符串长度？

我有以下 R data.table dt，它由几个数字列和两列字符串组成。

dt = data.table(
      numericvals = rep(25, 8),
      numeric = rep(42, 8),
      first = c("beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property"),
      second = rep(c("abcde"), 8)
  )

print(dt)
   numericvals numeric                   first second
1:          25      42        beneficiary, duke abcde
2:          25      42                  compose abcde
3:          25      42             herd primary abcde
4:          25      42                    stall abcde
5:          25      42                     deep abcde
6:          25      42 regular summary classify abcde
7:          25      42                   timber abcde
8:          25      42                 property abcde

列first包含一个或多个字符串。如果有多个，则用空格或逗号分隔。

我的目标是创建一个记录 first 中字符串长度的列，这些字符串的长度(通过 nchar())比 中的字符串长或短>第二次。如果它们的大小相同，则应忽略这种情况。

如果列每行仅由一个字符串组成，则这种分析对我来说会很容易。我将创建一个名为 longer 的新列，并跟踪 first 中的字符串长度(如果它更长)，即

dt[, longer:=ifelse(nchar(first) > nchar(second), nchar(first), 0)]

类似的缩写:

dt[, shorter:=ifelse(nchar(first) < nchar(second), nchar(first), 0)]

我不知道如何处理 first 中的多个字符串，特别是如果有 3 个字符串。

分析应如下所示:

   numericvals numeric                   first second  longer  shorter
1:          25      42        beneficiary, duke abcde  11       4
2:          25      42                  compose abcde  7        0
3:          25      42             herd primary abcde  7        4
4:          25      42                    stall abcde  0        0
5:          25      42                     deep abcde  0        4
6:          25      42 regular summary classify abcde  7, 7, 8  0
7:          25      42                   timber abcde  6        0
8:          25      42                 property abcde  8        0

对于是否存在多个较长/较短的情况，在data.table中添加逗号可能会很麻烦。这种格式会更容易使用，所以我想要的最终结果如下:

   numericvals numeric                   first second  longer  shorter
1:          25      42        beneficiary, duke abcde  11      4
2:          25      42                  compose abcde  7       0
3:          25      42             herd primary abcde  7       4
4:          25      42                    stall abcde  0       0
5:          25      42                     deep abcde  0       4
6:          25      42 regular summary classify abcde  7       0
6:          25      42 regular summary classify abcde  7       0
6:          25      42 regular summary classify abcde  8       0
7:          25      42                   timber abcde  6       0
8:          25      42                 property abcde  8       0

如何比较 data.table 中的多个字符串，为多个条目创建新行？

(我正在使用 R data.table，但我也很高兴使用 data.frame。)

编辑:根据下面的评论，我意识到第二个表是错误的。或者至少，值应该只计算一次。

最佳答案

使用基本函数但包装在 data.table 内

对于OP中的第一个输出:

dt[, do.call(rbind, mapply(function(x, snd) {
        lens <- nchar(x[x!=""])
        longer <- lens[lens > snd]
        if (length(longer) == 0L) longer <- 0L
        shorter <- lens[lens < snd]
        if (length(shorter) == 0L) shorter <- 0L

        list(list(longer), list(shorter))            
    }, strsplit(first, ",| "), nchar(second), SIMPLIFY=FALSE)), by=names(dt)]

对于OP中的第二个输出，

dt[, do.call(rbind, mapply(function(x, snd) {
    lens <- nchar(x[x!=""])
    longer <- lens[lens > snd]
    if (length(longer) == 0L) longer <- 0L
    shorter <- lens[lens < snd]
    if (length(shorter) == 0L) shorter <- 0L

    #pad to equal length
    if (length(longer) > length(shorter)) {
        shorter <- c(shorter, rep(0L, length(longer) - length(shorter)))
    } 
    if (length(longer) < length(shorter)) {
        longer <- c(longer, rep(0L, length(shorter) - length(longer)))
    }

    #second kind of output
    data.frame(longer, shorter)
}, strsplit(first, ",| "), nchar(second), SIMPLIFY=FALSE)), by=names(dt)]

说明: 首先使用 strsplit(first, ",| ") 将每个字符串拆分为单词，然后应用 OP 要求来检查字长是否大于或小于引用列的字长。然后，将结果行绑定(bind)到 data.frame 中并返回结果。

关于r - 如何使用 R data.table 比较列之间的多个字符串长度？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49887124/

r - 如何使用 R data.table 比较列之间的多个字符串长度？

上一篇：c# - 获取 TF30063 : You are not authorized to access https://{url}. Visualstudio.com/

下一篇：r - plotly 网络: edges are drawn over the vertices (should be the opposite)