java - Hadoop的MapReduce中的映射器将两次读取我的输入文件

标签 java hadoop mapreduce

我在编写MapReduce程序时遇到问题,该程序正在读取我的输入文件两次。已经通过why is my sequence file being read twice in my hadoop mapper class?回答了,但是不幸的是它没有帮助

我的Mapper类(class)是:

package com.siddu.mapreduce.csv;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class SidduCSVMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> 
{

    IntWritable one = new IntWritable(1);
    @Override
    public void map(LongWritable key, Text line,
            OutputCollector<Text, IntWritable> output, Reporter report)
            throws IOException 
    {
        String lineCSV= line.toString();

        String[] tokens = lineCSV.split(";");

        output.collect(new Text(tokens[2]), one);
    }

}

我的Reducer类是:
package com.siddu.mapreduce.csv;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class SidduCSVReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> 
{

    @Override
    public void reduce(Text key, Iterator<IntWritable> inputFrmMapper,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException 
    {
        System.out.println("In reducer the key is:"+key.toString());

        int relationOccurance=0;
        while(inputFrmMapper.hasNext())
        {
            IntWritable intWriteOb = inputFrmMapper.next();
            int val = intWriteOb.get();

            relationOccurance += val;

        }

        output.collect(key, new IntWritable(relationOccurance));

    }



}

最后,我的驱动程序类是:
package com.siddu.mapreduce.csv;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class SidduCSVMapReduceDriver 
{
    public static void main(String[] args) 
    {

        JobClient client = new JobClient();
        JobConf conf = new JobConf(com.siddu.mapreduce.csv.SidduCSVMapReduceDriver.class);

        conf.setJobName("Siddu CSV Reader 1.0");

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(com.siddu.mapreduce.csv.SidduCSVMapper.class);
        conf.setReducerClass(com.siddu.mapreduce.csv.SidduCSVReducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        client.setConf(conf);

        try
        {
            JobClient.runJob(conf);
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

最佳答案

您应该知道,hadoop会产生多次尝试任务,通常每个映射器两次。如果两次看到日志文件输出,则可能是原因。

关于java - Hadoop的MapReduce中的映射器将两次读取我的输入文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22464537/

相关文章:

hadoop - Hadoop和Cassandra比较2行

hadoop - 仅在 mapreduce 模式下出现 Pig 0.13 错误

hadoop - Output.collect mapreduce 似乎没有取正确的值?

java - 如何在禁用按钮的情况下启动程序 - Java?

java - 为什么 java.io.Reader#skip 是这样实现的?

java - MySQL 重启后 Hibernate 应用程序无法连接到 MySQL

hadoop - Map Reduce 已完成但 pig 作业失败

java - 无法启动配置单元

hadoop - 一个包含 HDFS 和 MapReduce 的文件数据库

java - 用 Java 逐行比较文本文件