performance - 提高性能功能

标签 performance r

我学习提高代码性能的方法。为了研究,我编写了返回基本描述性统计数据的函数,如 psych::describe。我已经尝试了不同版本的循环,目前这是我所能做的。

代码:

x <- matrix(rnorm(10*100), nrow=100) # sample data for tests

descStats <- function(x, na.rm = TRUE, trim = NULL, skew = FALSE, byrow = FALSE, digits =     getOption("digits")) {
 if (!is.matrix(x)) x <- as.matrix(x)
 if(byrow) x <- t(x)
 stats <- c("n", "mean", "se", "sd", "median", "min", "max", "range") # descriptive statistics
 if (skew) {
   library(moments)
   stats <- c(stats, "skewness", "kurtosis")
 }
 if (!is.null(trim)) {
   stats <- append(stats, "trimmed", which(stats == "mean"))
   trimmed <- function(x) base::mean(x, trim=trim)
 }
 n <- function(x) length(x)
 range <- function(x) max(x) - min(x)
 mean <- function(x) .Internal(mean(x)) # redefined mean function
 sd <- function(x) sqrt(sum((x - mean(x))^2)/(length(x)-1)) # redefined sd function
 se <- function(x) sqrt(sd(x)/length(x))
 median <- function(x) { # redefined median function
   n <- length(x)
   half <- (n + 1L)%/%2L
   if (n%%2L == 1L) 
     result <- .Internal(sort(x, partial = half))[half]
   else {
     result <- mean(.Internal(sort(x, partial = half + 0L:1L))[half + 0L:1L])
   }
 }
 describe <- function(x, na.rm=FALSE) {
   if (na.rm) x <- x[!is.na(x)]
   result <- vapply(stats, function(fun) eval(call(fun, x)), FUN.VALUE=numeric(1))
   return(result)
 }
 out <- t(vapply(seq_len(ncol(x)), function(i) describe(x[,i], na.rm=na.rm), FUN.VALUE=numeric(length(stats))))
 out <- round(out, digits=digits)
 return(out)
}

print(descStats(x))
##        n       mean    trimmed        se        sd     median       min      max    range
## [1,] 100  0.2524298  0.2763559 0.1024722 1.0500560  0.2842625 -2.905826 3.362598 6.268424
## [2,] 100 -0.1201740 -0.0627668 0.1027268 1.0552801 -0.0614541 -3.071836 2.247063 5.318899
## [3,] 100  0.2074781  0.1946393 0.1006384 1.0128089  0.1928790 -2.312749 2.564297 4.877047
## [4,] 100  0.1088077  0.1127540 0.0935370 0.8749172  0.0864728 -2.757226 2.883687 5.640913
## [5,] 100 -0.2163515 -0.2147170 0.1064167 1.1324524 -0.2836884 -3.431254 2.950466 6.381720
## [6,] 100 -0.0324696 -0.0229878 0.0968330 0.9376630  0.0919468 -2.474992 1.860961 4.335953
## [7,] 100 -0.1497724 -0.1665687 0.1047835 1.0979579 -0.1753578 -2.908781 2.885645 5.794425
## [8,] 100 -0.0197306  0.0101194 0.1030385 1.0616927  0.0615438 -2.711356 2.506423 5.217779
## [9,] 100 -0.0346922 -0.0290022 0.1018726 1.0378033  0.0231049 -2.467852 2.528595 4.996447
## [10,] 100  0.1251403  0.1222156 0.1012441 1.0250359  0.1606492 -2.566209 2.854519 5.420728

在每种情况下,我都将耗时与 microbenchmaark 进行了比较。例如:

library(microbenchmark)
bench <- microbenchmark(descStats(x), descStats2(x), times=1000)
print(bench)
boxplot(bench, outline=FALSE)

谁能提供更高效或更紧凑的代码版本?

更新:

这个函数的最终版本你可以看here .

最佳答案

这是一个在测试对象 x 上快 60% 的版本,它总是计算偏度和峰度(没有 moments 包)。

descStats2 <- function(x, na.rm = TRUE, trim = 0.1, skew = TRUE,
  byrow = FALSE, digits = getOption("digits")) {

  stopifnot(is.matrix(x))
  f <- function(x, na.rm) {
    if(na.rm) x <- x[!is.na(x)]
    n <- length(x)
    s <- numeric(11)
    s[1] <- n
    s[2] <- sum(x)/n
    dev <- x-s[2]
    dev_s2 <- sum(dev^2L)
    s[3] <- mean.default(x, trim=trim)
    s[5] <- sqrt(dev_s2/(n-1))
    s[4] <- sqrt(s[5]/n)
    s[6] <- median.default(x)
    s[7] <- min(x)
    s[8] <- max(x)
    s[9] <- s[8]-s[7]
    s[10] <- (sum(dev^3L)/n) / (dev_s2/n)^1.5
    s[11] <- n * sum(dev^4L) / (dev_s2^2L)
    s
  }
  if(byrow) x <- t(x)
  out <- matrix(0.0, ncol(x), 11, dimnames=list(NULL, c("n", "mean", "trimmed",
    "se", "sd", "median", "min", "max", "range", "skewness", "kurtosis")))
  for(i in 1:ncol(x))
    out[i,] <- f(x[,i], na.rm=na.rm)
  round(out, digits=digits)
}

我的函数速度更快的原因是因为我避免了很多不必要的函数调用,并且我存储并重新使用了我已经计算出的值(例如 devdev_s2s[5]等)。

library(compiler)
ds2c <- cmpfun(descStats2)

library(rbenchmark)
benchmark(descStats(x, skew=TRUE), descStats2(x), ds2c(x),
  order="relative", replications=1000)[,1:5]
#                        test replications elapsed relative user.self
# 3                   ds2c(x)         1000   3.969    1.000     3.920
# 2             descStats2(x)         1000   5.115    1.289     4.984
# 1 descStats(x, skew = TRUE)         1000   8.234    2.075     7.837
identical(descStats(x, skew=TRUE), descStats2(x))
# [1] TRUE

关于performance - 提高性能功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14209427/

相关文章:

reactjs - 什么时候在 NextJS 应用程序中使用可加载组件?

r - R Shiny 中的环境

r - 在 Windows 上安装 rjags 包

r - strptime 不完全工作

r - dist 函数的第 N 个最小对

performance - 为什么在遍历数组时 `while` 的性能据说比 `for` 慢?

performance - Scala 的理解性能

r - 用于矩阵或数据框的 writeClipboard?

c# - 到 ToList() 还是不到 ToList()?

javascript - 在客户端搜索、排序和分页大型数据集