r - 有效地计算每个数字范围内的数字

我正在寻找以下问题的更快解决方案。我将用一个小例子来说明问题，然后提供模拟大数据的代码，因为这就是这个问题的重点。我的实际问题大小是列表长度 = 100 万个条目。

比如说，我有两个列表，如下所示:

x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))

x 和 y 的属性:

列表 x 的每个元素总和始终为 100。
y 的每个元素都将始终排序并且始终在 1 到 100 之间。

问题:

现在，我想要的是这个。采用 x[[1]] 和 y[[1]]，我想找到 y[[1]]< 中的数字计数 是 1) <= 82 和 2) > 82 和 <= 100。那就是 c(4, 1) 因为数字 <= 82 是 c(1,2,3,55) 和 83 到 100 之间的数字是 c(90)。同样对于 x[[2]] 和 y[[2]]，c(0, 2, 1)。也就是说，答案应该是:

[[1]]
[1] 4 1

[[2]]
[1] 0 2 1

如果仍然不清楚，请告诉我。

100万条目的模拟数据

set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)

x <- lapply(seq_len(n), function(ix) {
    probs <- sample(100:1000, len[ix])
    probs <- probs/sum(probs)

    oo <- round(N * probs)
    if (sum(oo) != 100) {
        oo[1] <- oo[1] + (100 - sum(oo))
    }
    oo
})

require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1

到目前为止我做了什么:

使用 mapply(慢):

我首先想到将 rank 与 ties.method="first" 和 mapply (明显的选择有 2 个列表)一起使用，然后尝试了这个:

tt1 <- mapply(y, x, FUN=function(a,b) { 
    tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})

虽然这工作得很好，但在 1M 条目上需要花费大量时间。我认为计算 rank 和 diff 的开销多次增加。这需要 241 秒!

因此，我决定尝试通过使用 data.table 并使用“组”排序来克服 rank 和 diff 的使用柱子。我想出了一个更长但更快的解决方案，如下所示:

使用 data.table(更快):

xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]

# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1

这需要 26 秒。所以它快了大约9倍。我想知道是否有可能获得更多的加速，因为我必须在 5-10 个这样的 100 万个元素上递归计算它。谢谢。

最佳答案

这是另一种 data.table 方法。编辑我添加了一个(肮脏的？)hack 来加快速度并使其比 OP data.table 解决方案快约 2 倍。

# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))

# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))

# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval'))  # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]

# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))

# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]

# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1

关于r - 有效地计算每个数字范围内的数字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17748387/

r - 有效地计算每个数字范围内的数字

x 和 y 的属性:

问题:

100万条目的模拟数据

到目前为止我做了什么:

上一篇：algorithm - 范围内整数的乘法

下一篇：algorithm - 大小为 k 的所有子集，最大化子集之间的差异