java - 使用 Map Reduce 的最小最大计数

我开发了一个 Map reduce 应用程序来确定用户第一次和最后一次发表评论的时间以及该用户根据 Donald Miner 写的书发表的评论总数。

但我的算法的问题是reducer。我已经根据用户 ID 对评论进行了分组。我的测试数据包含两个用户标识，每个用户标识在不同日期发布 3 条评论。因此总共有 6 行。

所以我的 reducer 输出应该打印两条记录，每条记录显示用户第一次和最后一次评论以及每个用户 ID 的总评论。

但是，我的 reducer 正在打印 6 条记录。有人可以指出以下代码有什么问题吗？

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.arjun.mapreduce.patterns.mapreducepatterns.MRDPUtils;

import com.sun.el.parser.ParseException;

public class MinMaxCount {

    public static class MinMaxCountMapper extends 
            Mapper<Object, Text, Text, MinMaxCountTuple> {

        private Text outuserId = new Text();
        private MinMaxCountTuple outTuple = new MinMaxCountTuple();

        private final static SimpleDateFormat sdf = 
                     new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSS");

        @Override
        protected void map(Object key, Text value,
                org.apache.hadoop.mapreduce.Mapper.Context context)
                throws IOException, InterruptedException {

            Map<String, String> parsed = 
                     MRDPUtils.transformXMLtoMap(value.toString());

            String date = parsed.get("CreationDate");
            String userId = parsed.get("UserId");

            try {
                Date creationDate = sdf.parse(date);
                outTuple.setMin(creationDate);
                outTuple.setMax(creationDate);
            } catch (java.text.ParseException e) {
                System.err.println("Unable to parse Date in XML");
                System.exit(3);
            }

            outTuple.setCount(1);
            outuserId.set(userId);

            context.write(outuserId, outTuple);

        }

    }

    public static class MinMaxCountReducer extends 
            Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {

        private MinMaxCountTuple result = new MinMaxCountTuple();


        protected void reduce(Text userId, Iterable<MinMaxCountTuple> values,
                org.apache.hadoop.mapreduce.Reducer.Context context)
                throws IOException, InterruptedException {

            result.setMin(null);
            result.setMax(null);
            result.setCount(0);
            int sum = 0;
            int count = 0;
            for(MinMaxCountTuple tuple: values )
            {
                if(result.getMin() == null || 
                        tuple.getMin().compareTo(result.getMin()) < 0) 
                {
                    result.setMin(tuple.getMin());
                }

                if(result.getMax() == null ||
                        tuple.getMax().compareTo(result.getMax()) > 0)  {
                    result.setMax(tuple.getMax());
                }

                System.err.println(count++);

                sum += tuple.getCount();
            }

            result.setCount(sum);
            context.write(userId, result);
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String [] otherArgs = new GenericOptionsParser(conf, args)
                            .getRemainingArgs();
        if(otherArgs.length < 2 )
        {
            System.err.println("Usage MinMaxCout input output");
            System.exit(2);
        }


        Job job = new Job(conf, "Summarization min max count");
        job.setJarByClass(MinMaxCount.class);
        job.setMapperClass(MinMaxCountMapper.class);
        //job.setCombinerClass(MinMaxCountReducer.class);
        job.setReducerClass(MinMaxCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(MinMaxCountTuple.class);

        FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        boolean result = job.waitForCompletion(true);
        if(result)
        {
            System.exit(0);
        }else {
            System.exit(1);
        }

    }

}

Input: 
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-30T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-01T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="831878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-06-30T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-01T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="931878" />

output file contents part-r-00000:

831878  2011-07-30T07:29:33.343 2011-07-30T07:29:33.343 1
831878  2011-08-01T07:29:33.343 2011-08-01T07:29:33.343 1
831878  2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1
931878  2011-06-30T07:29:33.343 2011-06-30T07:29:33.343 1
931878  2011-07-01T07:29:33.343 2011-07-01T07:29:33.343 1
931878  2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1

job submission output:


12/12/16 11:13:52 INFO input.FileInputFormat: Total input paths to process : 1
12/12/16 11:13:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/16 11:13:52 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/16 11:13:52 INFO mapred.JobClient: Running job: job_201212161107_0001
12/12/16 11:13:53 INFO mapred.JobClient:  map 0% reduce 0%
12/12/16 11:14:06 INFO mapred.JobClient:  map 100% reduce 0%
12/12/16 11:14:18 INFO mapred.JobClient:  map 100% reduce 100%
12/12/16 11:14:23 INFO mapred.JobClient: Job complete: job_201212161107_0001
12/12/16 11:14:23 INFO mapred.JobClient: Counters: 26
12/12/16 11:14:23 INFO mapred.JobClient:   Job Counters 
12/12/16 11:14:23 INFO mapred.JobClient:     Launched reduce tasks=1
12/12/16 11:14:23 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12264
12/12/16 11:14:23 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient:     Launched map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient:     Data-local map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10124
12/12/16 11:14:23 INFO mapred.JobClient:   File Output Format Counters 
12/12/16 11:14:23 INFO mapred.JobClient:     Bytes Written=342
12/12/16 11:14:23 INFO mapred.JobClient:   FileSystemCounters
12/12/16 11:14:23 INFO mapred.JobClient:     FILE_BYTES_READ=204
12/12/16 11:14:23 INFO mapred.JobClient:     HDFS_BYTES_READ=888
12/12/16 11:14:23 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=43479
12/12/16 11:14:23 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=342
12/12/16 11:14:23 INFO mapred.JobClient:   File Input Format Counters 
12/12/16 11:14:23 INFO mapred.JobClient:     Bytes Read=761
12/12/16 11:14:23 INFO mapred.JobClient:   Map-Reduce Framework
12/12/16 11:14:23 INFO mapred.JobClient:     Map output materialized bytes=204
12/12/16 11:14:23 INFO mapred.JobClient:     Map input records=6
12/12/16 11:14:23 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/12/16 11:14:23 INFO mapred.JobClient:     Spilled Records=12
12/12/16 11:14:23 INFO mapred.JobClient:     Map output bytes=186
12/12/16 11:14:23 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/12/16 11:14:23 INFO mapred.JobClient:     Combine input records=0
12/12/16 11:14:23 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
12/12/16 11:14:23 INFO mapred.JobClient:     Reduce input records=6
12/12/16 11:14:23 INFO mapred.JobClient:     Reduce input groups=2
12/12/16 11:14:23 INFO mapred.JobClient:     Combine output records=0
12/12/16 11:14:23 INFO mapred.JobClient:     Reduce output records=6
12/12/16 11:14:23 INFO mapred.JobClient:     Map output records=6

最佳答案

啊捕获了罪魁祸首，只需将您的 reduce 方法的签名更改为以下内容:

protected void reduce(Text userId, Iterable<MinMaxCountTuple> values, Context context) throws IOException, InterruptedException {

基本上你只需要有 Context而不是 org.apache.hadoop.mapreduce.Reducer.Context

现在输出如下:

831878  2011-07-30T07:29:33.343 2011-08-02T07:29:33.343 3
931878  2011-06-30T07:29:33.343 2011-08-02T07:29:33.343 3

我在本地为您测试了它，这次更改成功了。尽管这是一种奇怪的行为，但如果有人能阐明这一点，那就太好了。不过，它与泛型有关。当 org.apache.hadoop.mapreduce.Reducer.Context使用它说:

"Reducer.Context is a raw type. References to generic type Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context should be parameterized"

但是当只使用 'Context' 时就没问题了。

关于java - 使用 Map Reduce 的最小最大计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13904061/

java - 使用 Map Reduce 的最小最大计数

上一篇：Apache 配置单元创建表

下一篇：eclipse - Hadoop WordCount 从命令行而不是从 Eclipse 运行