java - mapreduce 计数差异

我正在尝试编写一个程序来输出 2 列中计数之间的差异。所以我的数据看起来像这样:

2,1
2,3
1,2
3,1
4,2

我想计算 col1 中键的出现次数和 col2 中键的出现次数，然后取差值。输出应如下所示:

1,-1
2,0
3,0
4,1

这可以在一个 mapreduce 程序(mapper、reducer)中完成吗？

最佳答案

在 mapper 中，您将为每一行创建两个键，一个用于 col1，另一个用于 col2，其中值从每一列计数，如下所示:

2,1 -> 2:{1, 0} 和 1:{0, 1}

2,3 -> 2:{1, 0} 和 3:{0, 1}

1,2 -> 1:{1, 0} 和 2:{0, 1}

3,1 -> 3:{1, 0} 和 1:{0, 1}

4,2 -> 4:{1, 0} 和 2:{0, 1}

然后在 reducer 中你会得到这些结果，其中每一行都是每个 reduce 调用的键和值组合:

1 -> {0, 1}, {1, 0}, {0, 1}(相加会产生 -1)

2 -> {1, 0}, 2:{1, 0}, 2:{0, 1}, 2:{0, 1}(相加将产生 0)

3 -> {0, 1}, {1, 0}(相加会产生 0)

4 -> {1, 0}(将它们相加将产生 1)

更新:

这是 Hadoop 示例(它未经测试，可能需要一些调整才能使其正常工作):

public class TheMapper extends Mapper<LongWritable, Text, Text, ArrayPrimitiveWritable>{        

    protected void map(LongWritable offset, Text value, Context context) 
    throws IOException, InterruptedException {

        StringTokenizer tok = new StringTokenizer( value.toString(), "," );

        Text col1 = new Text( tok.nextToken() );
        context.write( col1, toArray(1, 0) );

        Text col2 = new Text( tok.nextToken() );        
        context.write( col2, toArray(0, 1) );
    }

    private ArrayPrimitiveWritable toArray(int v1, int v2){     
        return new ArrayPrimitiveWritable( new int[]{i1, i2} );
    }   
}

public class TheReducer extends Reducer<Text, ArrayPrimitiveWritable, Text, Text> {

  public void reduce(Text key, Iterable<ArrayPrimitiveWritable> values, Context context) 
  throws IOException, InterruptedException {

      Iterator<ArrayPrimitiveWritable> i = values.iterator();
      int count = 0;
      while ( i.hasNext() ){
          int[] counts = (int[])i.next().get();
          count += counts[0];
          count -= counts[1];
      }

      context.write( key, new Text("" + count) );
  }
}

关于java - mapreduce 计数差异，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47083481/

java - mapreduce 计数差异

上一篇：python - ipython 不被识别为内部或外部命令 (pyspark)

下一篇：hadoop - 仅列出 hbase 中来自 shell 的特定时间戳之后的行键