r - 在 data.table 的整个列上应用自定义函数？

我有一个很大的 Data Table有两列。我希望在特定列上应用自定义函数。产生问题的代码如下:

require(data.table)
X <- rep("This is just random text", 1e5)
data <- data.frame(1:1e5, replicate(1, X, simplify=FALSE), stringsAsFactors=FALSE)
colnames(data) <- paste("X", seq_len(ncol(data)), sep="")
DT <- as.data.table(data)

现在，我们有一个大数据表，看起来像

| X1 |            X2           |
|----|-------------------------|
| 1  | This is just random text|
| 2  | This is just random text|
| 3  | This is just random text|
| 4  | This is just random text|
| .. |            ...          |

考虑到这个 data.table 将是非常大的(大约 ~100M 行)，如果我想对这个列的任何一个做一些向量操作怎么办。

让我们以 X1 列为例。假设，我想对其应用以下函数:

Fun4X1 <- function(x){return(x+x*2)}

还有一个非常复杂的NLP X2 列上的函数看起来像

Fun4X2 <- function(x){
             require(stringr)
             return(str_split(x, " ")[[1]][1])
          }

对于大型数据集，我将如何执行此操作？请建议最小。作为我的耗时方法Function本身就很复杂。

附言我试过 foreach , sapply ，当然还有 for-loop并且在相当好的硬件系统上都非常慢。

最佳答案

该方法应该与将任何其他内置(或包加载)函数应用于 data.table 中的特定列没有什么不同。 :使用 list(fun(variable), otherfun(othervariable))构造类型。如果需要，您还可以命名结果列，否则它们将被命名为“V1”、“V2”等。

换句话说，对于您的问题，您可以执行以下操作:

DT[, list(X1 = Fun4X1(X1), X2 = Fun4X2(X2))]

但是，我怀疑您的很多速度变慢可能是由于您实际使用的功能造成的。比较以下细微的改进:

Fun4X2.old <- function(x){
  require(stringr)
  return(str_split(x, " ")[[1]][1])
}

Fun4X2.new1 <- function(x) {
  vapply(strsplit(x, " "), 
         function(y) y[1], character(1))
} 

Fun4X2.new2 <- function(x) {
  vapply(strsplit(x, " ", fixed=TRUE), 
         function(y) y[1], character(1))
} 

Fun4X2.sub <- function(x) sub("(.+?) .*", "\\1", x)

X <- rep("This is just random text", 1e5)    

system.time(out1 <- Fun4X2.old(X))
#    user  system elapsed 
#  18.838   0.000  18.659 
system.time(out2 <- Fun4X2.new1(X))
#    user  system elapsed 
#   0.000   0.000   0.944 
system.time(out3 <- Fun4X2.new2(X))
#    user  system elapsed 
#   1.584   0.000   0.270 
system.time(out4 <- Fun4X2.sub(X))
#    user  system elapsed 
#   0.000   0.000   0.222

最后一点，关于 your comment here :

@AnandaMahto I am looking for something similar to this but if I use your solution then the output on text column in not vectorized and I get same output even if I have different text in each row

顺便说一句，你的原版Fun4X2() (更名为 Fun4X2.old() 以上)表现出相同的行为。

DT2 <- data.table(X1 = 1:4, X2 = c("a b c", "d e f", "g h i", "j k l"))
DT2[, list(Fun4X1(X1), Fun4X2.old(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  a
# 3:  9  a
# 4: 12  a

DT2[, list(Fun4X1(X1), Fun4X2.new1(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  d
# 3:  9  g
# 4: 12  j

关于r - 在 data.table 的整个列上应用自定义函数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21286717/

r - 在 data.table 的整个列上应用自定义函数？

上一篇：c - 关于 SAT 求解器和 cnf 文件

下一篇：cuda - 关于经纱投票功能