我正在尝试并行处理多个列表项。
我的目标是:根据每列的值运行一些标签函数。然后返回带有节点名称、列名称和处理后的标签的数据框
使用普通的 for 循环,工作流程可以正常工作。但是,当我尝试在 foreach 循环中执行相同的操作时,返回的结果是 (请注意:以下只是原始数据集的抽象)
我不确定中间到底发生了什么。如果你能帮我解决这个问题,那就太棒了:-)
set.seed(12345)
options(stringsAsFactors = F)
# I. Random data generation (Original data is in data frame format)
random.data = list()
random.data[["one"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["two"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["three"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
# II. Some function applied to each column to label/classify the values
valslabel = function(DataCOlumn) {
if(mean(DataCOlumn) < 0.5) return("low")
return("high")
}
# III. Generating the desired output in a regular for loop :
desiredOutput = list()
for(frame.i in seq_along(random.data)) {
frame = random.data[[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0),
mappedField = character(0), label = character(0) )
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
desiredOutput[[frame.name]] = frame.results
}
print(desiredOutput)
# $one
# frame.name mappedField label
# 1 one V1 high
# 2 one V2 high
# 3 one V3 low
#
# $two
# frame.name mappedField label
# 1 two V1 low
# 2 two V2 high
# 3 two V3 low
#
# $three
# frame.name mappedField label
# 1 three V1 low
# 2 three V2 high
# 3 three V3 high
# IV. Using the "foreach" parallel execution
library(foreach)
library(doParallel)
cl = makeCluster(6)
registerDoParallel(cl)
output = foreach(frame.i = seq_along(random.data), .verbose = T) %dopar% {
frame = random.data[[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0) )
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
return(frame.results)
}
print(output)
# [[1]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
#
# [[2]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
#
# [[3]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
谢谢!
最佳答案
问题与初始化数据框的方式有关,并且在 foreach
环境中,选项 stringsAsFactors
未设置为 FALSE
。每个 foreach
循环中发生的事情是这样的
options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
#Warning message:
#In `[<-.factor`(`*tmp*`, iseq, value = "a") :
# invalid factor level, NA generated
d
# x
#1 <NA>
请注意,这只会给出警告,而不是错误,因此循环不会停止。如果您首先将 stringsAsFactors
设置为 FALSE
就没有问题(就像您在不并行运行东西时所做的那样)
options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
d
# x
#1 a
在您的全局环境中,您已经设置了 options(stringsAsFactors = FALSE)
,因此 %do%
循环有效。但是,此选项不会在每个并行作业的本地环境中传递,因此 %dopar%
循环会遇到上述问题。
查看以下示例的输出
options(stringsAsFactors = FALSE)
.Options$stringsAsFactors
#[1] FALSE
foreach(i = 1:3) %dopar% .Options$stringsAsFactors
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] TRUE
#
#[[3]]
#[1] TRUE
因此,解决方案是在 foreach
循环内设置选项 stringsAsFactors = FALSE
。
顺便说一句,如果可能的话,使用整个列向量而不是逐行创建数据框要好得多。在您的示例中,您可以替换
frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0))
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
与
frame.results <- data.frame(
frame.name = frame.name,
mappedField = colnames(frame),
label = valslabel1(colMeans(frame)))
其中 valslabel
函数已被矢量化版本替换
valslabel1 <- function(x) {
ifelse(x < 0.5, "low", "high")
}
关于r - "foreach"并行循环返回 <NA>s,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33081342/