java - 发送多个参数到reducer-MapReduce

标签 java hadoop mapreduce

我编写了一段代码,其功能类似于 SQL GroupBy。

我获取的数据集在这里:


250788681419,20090906,200937,200909,619,周日,周末,网上,早上,外出,语音,25078,PAY_AS_YOU_GO_PER_SECOND_PSB,成功发布服务,17,0,1,21.25,635-10 -112-30455


public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException 
{

        String line = value.toString();
        String[] attribute=line.split(",");
        double rs=Double.parseDouble(attribute[17]);

        String comb=new String();
        comb=attribute[5].concat(attribute[8].concat(attribute[10]));

            context.write(new Text(comb),new DoubleWritable (rs));

    }
 } 
public class MyReduce extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

protected void reduce(Text key, Iterator<DoubleWritable> values, Context context) 
          throws IOException, InterruptedException {

             double sum = 0;
             Iterator<DoubleWritable> iter=values.iterator();
                while (iter.hasNext()) 
                {
                    double val=iter.next().get();
                    sum = sum+ val;
                }
                context.write(key, new DoubleWritable(sum));
        };
     }

在Mapper中,当它的值发送第17个参数到reducer来求和时。现在我还想对第 14 个参数求和,如何将其发送到 reducer ?

最佳答案

如果您的数据类型相同,则创建 ArrayWritable 类应该可以解决此问题。该类应类似于:

public class DblArrayWritable extends ArrayWritable 
{ 
    public DblArrayWritable() 
    { 
        super(DoubleWritable.class); 
    }
}

您的映射器类如下所示:

public class MyMap extends Mapper<LongWritable, Text, Text, DblArrayWritable> 
{
  public void map(LongWritable key, Text value, Context context) throws IOException 
  {

    String line = value.toString();
    String[] attribute=line.split(",");
    DoubleWritable[] values = new DoubleWritable[2];
    values[0] = Double.parseDouble(attribute[14]);
    values[1] = Double.parseDouble(attribute[17]);

    String comb=new String();
    comb=attribute[5].concat(attribute[8].concat(attribute[10]));

    context.write(new Text(comb),new DblArrayWritable.set(values));

  }
}

在您的 reducer 中,您现在应该能够迭代 DblArrayWritable 的值。

根据您的示例数据,但看起来它们可能是不同的类型。您也许能够实现一个可以实现这一目的的ObjectArrayWritable类,但我对此并不确定,而且我看不到太多支持它的内容。如果它有效,该类将是:

public class ObjArrayWritable extends ArrayWritable 
{ 
    public ObjArrayWritable() 
    { 
        super(Object.class); 
    }
}

您可以通过简单地连接这些值并将它们作为文本传递给reducer来处理这个问题,然后reducer会再次分割它们。

另一个选择是实现您自己的 Writable 类。以下是其工作原理的示例:

public static class PairWritable implements Writable 
{
   private Double myDouble;
   private String myString;

    // TODO :-  Override the Hadoop serialization/Writable interface methods
    @Override
    public void readFields(DataInput in) throws IOException {
            myLong = in.readDouble();
            myString = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
            out.writeDouble(myLong);
            out.writeUTF(myString);
    }

    //End of Implementation

    //Getter and Setter methods for myLong and mySring variables
    public void set(Double d, String s) {
        myDouble = d;
        myString = s;
    }

    public Long getLong() {
        return myDouble;
    }
    public String getString() {
        return myString;
    }

}

关于java - 发送多个参数到reducer-MapReduce,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14516029/

相关文章:

java - 使用 Python 读取 AVRO 文件

java - fragment 如何将自身附加到 Activity 上?

hadoop - 使用 SQOOP 和 FLUME 将数据从 RDBMS 移动到 Hadoop

java - 如何在 Java 中连接两个字符串数组

java - Hadoop - MultipleOutputs.write - OutofMemory - Java 堆空间

ruby - hadoop流,如何设置分区?

hadoop - 我的 2 节点 hadoop 比我的 4 节点 hadoop 表现更好

apache-spark - 如何获得 PySpark 中两个 RDD 之间的区别?

javascript - 在 JavaScript 函数中等待图像?

java - sleep 线程唤醒后会发生什么?