sql - 如何根据时间戳和值查找同步ID

我正在尝试查找同步数据条目，它们在一定量的时间戳上共享特定值(“ref”)。

虚拟数据:

library(data.table)

dft <- data.table(
  id = rep(1:5, each=5),
  time = rep(1:5, 5),
  ref = c(10,11,11,11,11,
          10,11,11,11,21,
          20,31,31,31,31,
          20,41,41,41,41,
          20,51,51,51,51)
)

setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft

在该示例中，ID 1 和 2 将在第 1,2,6,7,11,12,16,17 行中的 4 个时间戳上同步，因为它们共享相同的 ref 值 (行用 ! 标记)。注意:它们在一个时间戳内共享相同的引用值，并且可能在下一个时间戳中共享另一个引用值。

我该如何解决这个问题？我还想定义时间戳的数量，其中值必须相同。如果我定义至少 5 个时间戳必须同步，则该示例中不应产生任何 ID。如果为 4 或更低，则 ID 的 1 和 2 应显示为同步数据条目。

我必须对数百万行进行计算，因此我更喜欢 data.table 或 dplyr 解决方案或任何其他高性能解决方案(SQL 也可以很好)。

    id                time ref
 1:  1 2018-10-14 02:00:01  10    !
 2:  2 2018-10-14 02:00:01  10    !
 3:  3 2018-10-14 02:00:01  20
 4:  4 2018-10-14 02:00:01  20
 5:  5 2018-10-14 02:00:01  20
 6:  1 2018-10-14 02:00:02  11    !
 7:  2 2018-10-14 02:00:02  11    !
 8:  3 2018-10-14 02:00:02  31
 9:  4 2018-10-14 02:00:02  41
10:  5 2018-10-14 02:00:02  51
11:  1 2018-10-14 02:00:03  11    !
12:  2 2018-10-14 02:00:03  11    !
13:  3 2018-10-14 02:00:03  31
14:  4 2018-10-14 02:00:03  41
15:  5 2018-10-14 02:00:03  51
16:  1 2018-10-14 02:00:04  11    !
17:  2 2018-10-14 02:00:04  11    !
18:  3 2018-10-14 02:00:04  31
19:  4 2018-10-14 02:00:04  41
20:  5 2018-10-14 02:00:04  51
21:  1 2018-10-14 02:00:05  11
22:  2 2018-10-14 02:00:05  21
23:  3 2018-10-14 02:00:05  31
24:  4 2018-10-14 02:00:05  41
25:  5 2018-10-14 02:00:05  51

对@DavidArenburg 的两个示例进行基准测试:

library(microbenchmark)

mc = microbenchmark(times = 100,
  res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
  res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)

mc

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
 res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987   100  a 
 res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397   100   b

最佳答案

可能的 data.table 解决方案

dft[dft, .(id, id2 = x.id), # get the desired columns
         on = .(id > id, time, ref), # the join condition
         nomatch = 0L, # remove unmatched records (NAs)
         allow.cartesian = TRUE # In case of a big join, allow Cartesian join 
     ][, .N, by = .(id, id2)] # Count obs. per ids combinations

#    id id2 N
# 1:  1   2 4
# 2:  3   4 1
# 3:  3   5 1
# 4:  4   5 1

说明

我们在 time 和 ref 上进行自连接，同时指定 id > id，这样我们就不会连接到相同的 id 并在删除时提取连接的 id(id 和 x.id 它们是来自两个数据集的连接 id)所有不匹配的行 (nomatch = 0L)。最后，我们计算匹配的组合(.N是data.table中的一个特殊符号，用于存储每个组合的obs数量)。

旧的(并且涉及更多的解决方案)

dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref)
    ][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]

关于sql - 如何根据时间戳和值查找同步ID，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51771490/

sql - 如何根据时间戳和值查找同步ID

上一篇：scala - 如何处理spark sql中缺失的列

下一篇：css - Angular:使用 ngIf 纠正行的交替着色