R 拆分 DF 并并行运行测试

我有两个矩阵，我想对其进行一些统计，其中我将 dataframe1 与 dataframe2 的每一行进行比较。这些是大型数据框(300,000 行和 40,000 行)，因此需要比较很多。

我做了一些函数来应用统计数据。我想知道是否可以将 dataframe1 分割成 block 并在多个核心上并行运行这些 block 。

library(lawstat)
library(reshape2)
df1 = matrix(ncol= 100, nrow=100)
for ( i in 1:100){
  df1[,i] =floor(runif(100, min = 0, max =3))
}

df2 = matrix(ncol= 100, nrow=1000)
for ( i in 1:100){
  df2[,i] =runif(1000, min = 0, max =1000)
}

testFunc<- function(df1, df2){
      x=apply(df1, 1, function(x) apply(df2, 1, function(y) levene.test(y,x)$p.value))
      x=melt(x)
      return(x)
    }

system.time(res <- testFunc(df1,df2 ))

一些统计数据(例如 leven 测试)需要相当长的时间来计算，因此任何可以加快计算速度的方法都会很棒。

最佳答案

您的函数还有优化的空间，但这里是使用 parallel 包进行改进的示例:

library(parallel)
library(snow)

# I have a quad core processor so I am using 3 cores here.
cl <- snow::makeCluster(3)
testFunc2<- function(df1, df2){
  x <- parallel::parApply(cl = cl, X = df1, 1, function(x, df2) apply(df2, 1, 
function(y) lawstat::levene.test(y,x)$p.value), df2)
  x <- melt(x)
  return(x)
}
system.time(res <- testFunc2(df1,df2 ))

在我的机器上，如果集群大小为 3，则运行时间至少会减少一半。

编辑:我对鄙视你的代码感到难过，所以下面是一个精简的 levene.test 函数，它比在大多数家庭/工作机器上并行运行更能提高性能。

lev_lite <- function(y, group){
  N <- 100 # or length(y)
  k <- 3   # or length(levels(group)) after setting to as.factor below
  
  reorder <- order(group)
  group <- group[reorder]
  y <- y[reorder]
  group <- as.factor(group)
  n <- tapply(y,group, FUN = length)
  yi_bar <- tapply(y,group, FUN = median)
  zij <- abs(y - rep(yi_bar, n))
  zidot <- tapply(zij, group, FUN = mean)
  zdotdot <- mean(zij)
  # test stat, see wiki
  W <- ((N - k)/(k - 1)) * (
    sum(n*(zidot - zdotdot)^2)/
      sum((zij - rep(zidot, n))^2))
  
  #p val returned
  1 - pf(W, k-1, N-k)
}

testFunc2 <- function(df1, df2){
  x <- apply(df1, 1, function(x) apply(df2, 1, lev_lite, group = x))
  x <- melt(x)
  return(x)
}

> system.time(res <- testFunc(df1[1:50, ],df2[1:50,] ))
user  system elapsed 
5.53    0.00    5.56 
> system.time(res2 <- testFunc2(df1[1:50, ],df2[1:50, ] ))
user  system elapsed 
1.13    0.00    1.14 
> max(res2 - res)
[1] 2.220446e-15

在没有并行化的情况下，这大约提高了 5 倍。

关于R 拆分 DF 并并行运行测试，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47763768/

R 拆分 DF 并并行运行测试

上一篇：c# - C# 的 GCM 或 CCM 实现

下一篇：python - Scipy cKDTree query_pairs 与 query_ball_tree