Hadoop 二级排序 - 使用或不使用

标签 hadoop bigdata secondary-sort

我有来自交通数据分析的事故输入数据。一些列是:

事故 ID、事故日期、星期几

1, 1/1/1979, 5 (星期四)

2, 1/2/1979, 6 (星期五)

…………

3, 1/1/1980, 0 (星期日)

我正在尝试解决以下问题:

查找每年每天发生的事故数量

所以输出应该是这样的:

其中键是(年,星期几)

和值=当天的事故数量
这里第 1 行代表 , 年 =1979 日 = 星期日,事故数量 =500 等等。

1979,1     500

1979,2    1500

1979,3    2500

1979,4    3500

1979,5    4500

1979,6    5500

1979,7    6500

1980,1     500

1980,2    1500

1980,3    2500

1980,4    3500

1980,5    4500

在这种情况下,我尝试使用辅助排序方法来解决它。这是解决这个问题的正确方法吗?

如果二级排序是正确的方法,它对我不起作用。这里是关键类,mapper和reducer。但是我的输出并没有达到预期。请帮忙 ..
public class DOW implements WritableComparable<DOW> {
    private Text year;
    private Text day;

    // private final Text count;

    // private int count;
    public DOW() {
        this.year = new Text();
        this.day = new Text();
        // this.count = count;
    }

    public DOW(Text year, Text day) {
        this.year = year;
        this.day = day;
        // this.count = count;
    }

    public Text getYear() {
        return this.year;
    }

    public void setYear(Text year) {
        this.year = year;
    }

    public Text getDay() {
        return this.day;
    }

    public void setDay(Text day) {
        this.day = day;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        // TODO Auto-generated method stub
        year.readFields(in);
        day.readFields(in);

    }

    @Override
    public void write(DataOutput out) throws IOException {
        // TODO Auto-generated method stub
        year.write(out);
        day.write(out);
    }

    @Override
    public int compareTo(DOW o) {
        // TODO Auto-generated method stub
        int cmp = year.compareTo(o.year);
        if (cmp != 0) {
            return cmp;
        }
        return o.day.compareTo(this.day);
    }

    @Override
    public String toString() {
        // TODO Auto-generated method stub
        return year + "," + day;
    }

    @Override
    public boolean equals(Object o) {
        // TODO Auto-generated method stub
        if (o instanceof DOW) {
            DOW tp = (DOW) o;
            return year.equals(tp.year) && day.equals(tp.day);
        }
        return false;
    }

    @Override
    public int hashCode() {
        // TODO Auto-generated method stub
        return year.hashCode() * 163 + day.hashCode();
    }
}

public class AccidentDowDemo extends Configured implements Tool {

    public static class DOWMapper extends Mapper<LongWritable, Text, DOW, IntWritable> {
        private static final Logger sLogger = Logger.getLogger(DOWMapper.class);

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws java.io.IOException, InterruptedException {

            if (value.toString().contains(",")) {
                String[] array = value.toString().split(",");
                if (!array[9].equals("Date")) {
                    Date dt = null;
                    try {
                        dt = new SimpleDateFormat("dd/mm/yyyy").parse(array[9]);

                    } catch (ParseException e) {
                        // TODO Auto-generated catch block

                        e.printStackTrace();
                    }

                    int year = dt.getYear();

                    int day = Integer.parseInt(array[10].toString());
                                        context.write(new DOW(new Text(Integer.toString(year)),
                            new Text(Integer.toString(day))),
                            new IntWritable(1));
                }
            }
        };
    }

    public static class DOWReducer extends Reducer<DOW, IntWritable, DOW, IntWritable> {
        private static final Logger sLogger = Logger
                .getLogger(DOWReducer.class);

        @Override
        protected void reduce(DOW key, Iterable<IntWritable> values,
                Context context) throws java.io.IOException,
                InterruptedException {
            int count = 0;
            sLogger.info("key =" + key);
            for (IntWritable x : values) {
                int val = Integer.parseInt(x.toString());
                count = count + val;
            }
            context.write(key, new IntWritable(count));
        };
    }

    public static class FirstPartitioner extends Partitioner<DOW, IntWritable> {

        @Override
        public int getPartition(DOW key, IntWritable value, int numPartitions) {
            // TODO Auto-generated method stub

            return Math.abs(Integer.parseInt(key.getYear().toString()) * 127)
                    % numPartitions;
        }
    }

    public static class KeyComparator extends WritableComparator {
        protected KeyComparator() {
            super(DOW.class, true);
        }

        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {
            // TODO Auto-generated method stub

            DOW ip1 = (DOW) w1;
            DOW ip2 = (DOW) w2;
            int cmp = ip1.getYear().compareTo(ip2.getYear());
            if (cmp == 0) {
                cmp = -1 * ip1.getDay().compareTo(ip2.getDay());
            }
            return cmp;
        }
    }

    public static class GroupComparator extends WritableComparator {
        protected GroupComparator() {
            super(DOW.class, true);
        }

        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {

            // TODO Auto-generated method stub
            DOW ip1 = (DOW) w1;
            DOW ip2 = (DOW) w2;
            return ip1.getYear().compareTo(ip2.getYear());
        }
    }
}

最佳答案

如果你需要基本模拟

select year, day, count(*) as totalPerDay from DATA group by year, day

比你不需要二次排序。

但是,如果您需要生成类似于 CUBE 的东西,您需要计算一项 MR 工作中每年的总数和每周的总数,那么二次排序是可行的方法。

关于Hadoop 二级排序 - 使用或不使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32470773/

相关文章:

java - 未能设置KeyComparator函数

java - Hadoop Mapreduce 计数器

hadoop - Flink数据源迭代

nosql - Cassandra CQL 时间范围查询

mongodb - 那里有什么好的大数据演示/示例应用程序吗?

python-3.x - Py4JJavaError : An error occurred while calling o37. 显示字符串。 Spark & python 3

xml - Oozie工作流程架构错误

json - 将JSON格式表加载到Pig中

python - 使用另一个列表对 Python 列表中的字符串进行排序

hadoop - 使用复合键时遍历值时部分键发生变化 - Hadoop