R 中数据集内存大小的经验法则

是否有任何经验法则可以知道 R 何时在处理 RAM 中的给定数据集时会遇到问题(给定 PC 配置)？

例如，我听说一个经验法则是您应该为每个单元格考虑 8 个字节。然后，如果我对 1.000.000 列的 1.000.000 个观察值接近 8 GB - 因此，在大多数家用计算机中，我们可能不得不将数据存储在 HD 中并分块访问。

以上正确吗？我们可以事先应用哪些内存大小和使用的经验法则？我的意思是有足够的内存不仅可以加载对象，还可以执行一些基本操作，例如一些数据整理、一些数据可视化和一些分析(回归)。

PS:最好解释一下经验法则是如何工作的，所以它不仅仅是一个黑匣子。

最佳答案

一些不同大小的向量的内存占用，以字节为单位。

n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")

sapply(
  n,
  function(n)
  {
    strings_of_one_hundred_chars <- replicate(
      n,
      paste(sample(letters, 100, replace = TRUE), collapse = "")
    )
    sapply(
      list(
        Integers                                 = integer(n),
        Floats                                   = numeric(n),
        Logicals                                 = logical(n),
        "Empty strings"                          = character(n),
        "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
        "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
        "Factor of empty strings"                = factor(character(n)),
        "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
        "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
        Raw                                      = raw(n),
        "Empty list"                             = vector("list", n)
      ),
      object.size
    )
  }
)

有些值在 64/32 位 R 下有所不同。

## Under 64-bit R
##                                          1   1000     1e+06
## Integers                                48   4040   4000040
## Floats                                  48   8040   8000040
## Logicals                                48   4040   4000040
## Empty strings                           96   8088   8000088
## Identical strings, nchar=100           216   8208   8000208
## Distinct strings, nchar=100            216 176040 176000040
## Factor of empty strings                464   4456   4000456
## Factor of identical strings, nchar=100 584   4576   4000576
## Factor of distinct strings, nchar=100  584 180400 180000400
## Raw                                     48   1040   1000040
## Empty list                              48   8040   8000040

## Under 32-bit R
##                                          1   1000     1e+06
## Integers                                32   4024   4000024
## Floats                                  32   8024   8000024
## Logicals                                32   4024   4000024
## Empty strings                           64   4056   4000056
## Identical strings, nchar=100           184   4176   4000176
## Distinct strings, nchar=100            184 156024 156000024
## Factor of empty strings                272   4264   4000264
## Factor of identical strings, nchar=100 392   4384   4000384
## Factor of distinct strings, nchar=100  392 160224 160000224
## Raw                                     32   1024   1000024
## Empty list                              32   4024   4000024

请注意，当同一字符串有很多重复时(但不是当它们都是唯一的时)，因子的内存占用比字符向量小。

关于R 中数据集内存大小的经验法则，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21754319/

R 中数据集内存大小的经验法则

上一篇：来自类型参数的Scala trait运行时类

下一篇：sed - bash 在每个空格前添加反斜杠？