r - 在 R 中逐行绑定(bind) data.frames 而不创建副本

我有大量 data.frames 需要按列成对绑定(bind)，然后按行绑定(bind)，然后再输入预测模型。由于不会修改任何值，我希望最终的 data.frame 指向我列表中的原始 data.frames。

例如:

library(pryr)

#individual dataframes
df1 <- data.frame(a=1:1e6+0, b=1:1e6+1)
df2 <- data.frame(a=1:1e6+2, b=1:1e6+3)
df3 <- data.frame(a=1:1e6+4, b=1:1e6+5)

#each occupy 16MB
object_size(df1)  # 16 MB
object_size(df2)  # 16 MB
object_size(df3)  # 16 MB
object_size(df1, df2, df3)  # 48 MB

#will be in a named list
dfs <- list(df1=df1, df2=df2, df3=df3)

#putting into list doesn't create a copy
object_size(df1, df2, df3, dfs)  #48MB

最终的 data.frame 将具有此方向(每对唯一的 data.frames 由列绑定(bind)，然后对由行绑定(bind)):

df1, df2
df1, df3
df2, df3

我目前正在这样实现:

#generate unique df combinations
df_names <- names(dfs)
pairs <- combn(df_names, 2, simplify=FALSE)

#bind dfs by columns
combo_dfs <- lapply(pairs, function(x) cbind(dfs[[x[1]]], dfs[[x[2]]]))

#no copies created yet
object_size(dfs, combo_dfs)  # 48MB

#bind dfs by rows
combo_df <- do.call(rbind, combo_dfs)

#now data gets copied
object_size(combo_df)  # 96 MB
object_size(dfs, combo_df)  # 144 MB

如何避免复制我的数据但仍能获得相同的最终结果？

最佳答案

按您希望的方式存储值需要 R 对数据帧进行一些压缩。我不相信数据帧支持压缩。

如果您希望以这种方式存储数据的动机是难以将其放入内存中，您可以尝试 ff package .这将允许您以更紧凑的方式将其存储在磁盘上。 ffdf 类似乎具有您需要的属性:

By default, creating an ’ffdf’ object will NOT create new ff files, instead existing files are ref- erenced. This differs from data.frame , which always creates copies of the input objects, most notably in data.frame(matrix()) , where an input matrix is converted to single columns. ffdf by contrast, will store an input matrix physically as the same matrix and virtually map it to columns.

此外，ff 包针对快速访问进行了优化。

请注意，我自己没有使用过这个包，所以我不能保证它会解决你的问题。

关于r - 在 R 中逐行绑定(bind) data.frames 而不创建副本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36870416/

r - 在 R 中逐行绑定(bind) data.frames 而不创建副本

上一篇：c++ - 为什么会出现这种性能下降？

下一篇：c - 内存映射内核空间的解剖结构