r - 如何向量化或以其他方式加速 R 中的循环逻辑？

长期潜伏者，第一次提问。

我正在尝试为 20M+ 项目数据集计算“两组项目之间的共同项目”。示例数据如下所示。

#serially numbered items
parents <- rep(1:10000)

#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))

#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
  if (numchild[x]>0){
    f1 <- sample(1:length(parents), size=numchild[x])
    f2 <- list(parents[f1])
    parent_child <- c(parent_child, f2)
  }
  else {
    parent_child <- c(parent_child, list(x+1))    #if numchild=0, make up something
  }
}

这是我想要做的:假设父项 #1 有 5 个子项——1、2、3、4、5，父项 #2 有 3 个子项——4、10、22。

我想计算每个 (parent_i, parent_j) 组合的长度(交集)。在上述情况下，它将是 1 个常见项目 - 4。

我正在为 1000 万个以上的父项执行此操作，这些父项平均有 15-20 个子项，范围为 (0,100)。所以这是一个 10M x 10M 的项目-项目矩阵。

我有一个 foreach 循环，我正在一个较小的子集上进行测试，该子集可以工作但对于完整数据集(具有 256GB RAM 的 64 核机器)的规模并不大。使用下面的循环，我已经只计算了用户-用户矩阵的一半 --> (parent_i, parent_j) 为此目的与 (parent_j, parent_i) 相同。

#small subset
a <- parent_child[1:1000]

outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
  b <- a[[i]]
  rest <- a[i+1:length(a)]

  foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
    common <- length(intersect(b, rest[[j]]))
    if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
  }  
}

我一直在试验这方面的变化(使用Reduce，将父子存储在数据帧中等)，但运气不佳。

有没有办法制作这个比例？

最佳答案

我扭转了 split ，这样我们就有了父子关系

len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len), 
                      unlist(parent_child, use.names=FALSE))

像下面这样构造一个字符串，其中一对 parent 共享一个 child ，跨所有 child

keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
    x <- combn(sort(x), 2)
    paste(x[1,], x[2,], sep=".")
})

和计数

table(unlist(int, use.names=FALSE))

或者更快一点

xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)

为了

f1 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], function(x) {
        x <- combn(sort(x), 2)
        paste(x[1,], x[2,], sep=".")
    })

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

with(这适用于所有 10000 个父子元素)

> system.time(ans1 <- f1(parent_child))
   user  system elapsed 
 14.625   0.012  14.668 
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 
       1        1        1        1        1        1

不过，我不确定这是否真的会扩展到您所谈论的问题的大小——它是每个 child 的 parent 数量的多项式。

加速的一种可能性是“内存”组合计算，使用参数的长度作为“键”并将组合存储为“值”。这减少了 combn 的次数被调用到 child_parent 元素的唯一长度的数量。

combn1 <- local({
    memo <- new.env(parent=emptyenv())
    function(x) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
    }
})

f2 <- function(parent_child) {
    len <- sapply(parent_child, length)
    child_parent <- split(rep(seq_along(parent_child), len), 
                          unlist(parent_child, use.names=FALSE))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn1)

    xx <- unlist(int, use.names=FALSE)
    nms <- unique(xx)
    cnt <- match(xx, nms)
    setNames(tabulate(cnt, length(nms)), nms)
}

这有点帮助

>     system.time(ans2 <- f2(parent_child))
   user  system elapsed 
  5.337   0.000   5.347 
>     identical(ans1, ans2)
[1] TRUE

慢的部分现在是 paste

>     Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
                 self.time self.pct total.time total.pct
"paste"               3.92    73.41       3.92     73.41
"match"               0.74    13.86       0.74     13.86
"unique.default"      0.40     7.49       0.40      7.49
"as.character"        0.08     1.50       0.08      1.50
"unlist"              0.08     1.50       0.08      1.50
"combn"               0.06     1.12       0.06      1.12
"lapply"              0.02     0.37       4.00     74.91
"any"                 0.02     0.37       0.02      0.37
"setNames"            0.02     0.37       0.02      0.37

$by.total
...

我们可以通过将具有共享子 ID 的父项编码为单个整数来避免这种情况；由于浮点数在 R 中的表示方式，这将精确到大约 2^21

encode <- function(x, y, n)
    (x - 1) * (n + 1) + y
decode <- function(z, n)
    list(x=ceiling(z / (n + 1)), y = z %% (n + 1))

并将我们的 combn1 和 f2 函数调整为

combn2 <- local({
    memo <- new.env(parent=emptyenv())
    function(x, encode_n) {
        key <- as.character(length(x))
        if (!exists(key, memo))
            memo[[key]] <- t(combn(length(x), 2))
        encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
    }
})

f3 <- function(parent_child) {
    encode_n <- length(parent_child)
    len <- sapply(parent_child, length)
    child_parent <-
        unname(split(rep(seq_along(parent_child), len), 
                     unlist(parent_child, use.names=FALSE)))

    keep <- sapply(child_parent, length) > 1
    int <- lapply(child_parent[keep], combn2, encode_n)

    id <- unlist(int, use.names=FALSE)
    uid <- unique(xx)
    n <- tabulate(match(xx, uid), length(uid))
    do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}

导致

> system.time(f3(parent_child))
   user  system elapsed 
  2.140   0.000   2.146

这与 jlhoward 的修订答案相比非常有利(请注意，前一行中的时间为 10,000 亲子关系)

> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
   user  system elapsed 
  2.465   0.000   2.468
> system.time(f3(parent_child[1:99]))
   user  system elapsed 
  0.016   0.000   0.014

并以更合理的方式进行缩放。

值得一提的是，数据生成例程位于 Patrick Burn 的 R Inferno 的第二个圆圈中，使用“复制和附加”算法而不是预先分配空间并填充它。通过编写 for 来避免这种情况。循环体作为一个函数，并使用 lapply。避免在 for 中使用复杂的条件通过预先解决问题来循环

numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))

或通过从生成正整数值的分布 (rpois, rbinom) 中采样。然后数据生成

n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)

关于r - 如何向量化或以其他方式加速 R 中的循环逻辑？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20307020/

r - 如何向量化或以其他方式加速 R 中的循环逻辑？

上一篇：R: ggplot2 - switch 参数没有按预期工作

下一篇：amazon-web-services - AWS 密码而不是 key 对