我想知道如何在 hadoop reducer 函数中设置条件来过滤键值对。例如,在下面给出的单词计数示例中,我怎样才能得到那些计数大于某个阈值(比如 3)的单词。
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
最佳答案
reduce <- function(word, counts) {
if(sum(counts) > 3)
keyval(word, sum(counts))
}
关于r - 在R中过滤hadoop reducer函数中的键值对,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33077684/