Hadoop:Reducer 将 Mapper 输出写入输出文件

标签 hadoop mapreduce reduce

我遇到了一个非常非常奇怪的问题。 reducer 确实可以工作,但是如果我检查输出文件,我只会找到映射器的输出。 当我尝试调试时,在将映射器的输出值类型从 Longwritable 更改为 Text 后,我​​发现字数示例存在同样的问题

    package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool {

   public static class Map
       extends Mapper<LongWritable, Text, Text, Text> {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text wtf, Context context)
         throws IOException, InterruptedException {
       String line = wtf.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, new Text("frommapper"));
       }
     }
   }

   public static class Reduce
       extends Reducer<Text, Text, Text, Text> {
     public void reduce(Text key, Text wtfs,
         Context context) throws IOException, InterruptedException {
/*
       int sum = 0;
       for (IntWritable val : wtfs) {
         sum += val.get();
       }
       context.write(key, new IntWritable(sum));*/
    context.write(key,new Text("can't output"));
     }
   }

   public int run(String [] args) throws Exception {
     Job job = new Job(getConf());
     job.setJarByClass(WordCount.class);
     job.setJobName("wordcount");


     job.setOutputKeyClass(Text.class);
     job.setMapOutputValueClass(Text.class);
       job.setOutputValueClass(Text.class);
     job.setMapperClass(Map.class);
     //job.setCombinerClass(Reduce.class);
     job.setReducerClass(Reduce.class);

     job.setInputFormatClass(TextInputFormat.class);
     job.setOutputFormatClass(TextOutputFormat.class);

     FileInputFormat.setInputPaths(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

     boolean success = job.waitForCompletion(true);
     return success ? 0 : 1;
         }

   public static void main(String[] args) throws Exception {
     int ret = ToolRunner.run(new WordCount(), args);
     System.exit(ret);
   }
}

这是结果

JobClient:     Combine output records=0
12/06/13 17:37:46 INFO mapred.JobClient:     Map input records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce shuffle bytes=116
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce output records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Spilled Records=14
12/06/13 17:37:46 INFO mapred.JobClient:     Map output bytes=96
12/06/13 17:37:46 INFO mapred.JobClient:     Combine input records=0
12/06/13 17:37:46 INFO mapred.JobClient:     Map output records=7
12/06/13 17:37:46 INFO mapred.JobClient:     Reduce input records=7

然后我在输出文件中发现了奇怪的结果。无论是否更改reduce输出值的类型,将map的输出值类型和reducer的输入键类型更改为Text后都会出现此问题。我也被迫更改 job.setOutputValue(Text.class)

a   frommapper
a   frommapper
a   frommapper
gg  frommapper
h   frommapper
sss frommapper
sss frommapper

帮助!

最佳答案

您的 reduce 函数参数应如下所示:

public void reduce(Text key, Iterable <Text> wtfs,
     Context context) throws IOException, InterruptedException {

通过您定义参数的方式,reduce 操作不会获取值列表,因此它只会输出从 map 函数获得的任何输入,因为

sum+ = val.get()

每次都是从 0 到 1,因为每个 <key, value><word, one> 的形式配对单独来到 reducer 。

此外,映射器函数通常不会写入输出文件(我从未听说过,但我不知道这是否可能)。通常情况下,始终是 reducer 写入输出文件。 Mapper 输出是由 Hadoop 透明处理的中间数据。因此,如果您在输出文件中看到某些内容,那一定是 reducer 输出,而不是 mapper 输出。如果您想验证这一点,您可以转到您运行的作业的日志,并分别检查每个映射器和缩减器中发生的情况。

希望这对您有所帮助。

关于Hadoop:Reducer 将 Mapper 输出写入输出文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11025390/

相关文章:

javascript - 通过组 CouchDB View 获取最新项目

hadoop - 配置 MapReduce 程序以仅在现有程序中运行 reducer

apache-spark - Spark 减少功能 : understand how it works

hadoop - 搜索HBase表的内容

hadoop - 数据插入特别是 hbase regionserver

Hadoop 映射减少 : Order of records while grouping

python - 有没有办法在 Python 中指定 reduce() 累加器?

Hadoop HDFS : input/output error when creating user folder

java - pig导入hdfs数据到hbase报错

java - 有一个 Mapper 类是线程安全的