java - hadoop mapreduce : handling a text file with a header

我正在玩和学习 hadoop MapReduce。

我正在尝试映射来自 VCF 文件 ( http://en.wikipedia.org/wiki/Variant_Call_Format ) 的数据:VCF 是一个制表符分隔的文件，以(可能很大的)标题开头。需要此 header 才能获取正文中记录的语义。

我想创建一个使用这些数据的映射器。必须可以从此 Mapper 访问 header 才能解码行。

来自 http://jayunit100.blogspot.fr/2013/07/hadoop-processing-headers-in-mappers.html ，我创建了这个 InputFormat，带有自定义阅读器:

  public static class VcfInputFormat extends FileInputFormat<LongWritable, Text>
    {
    /* the VCF header is stored here */
    private List<String> headerLines=new ArrayList<String>();

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        return new VcfRecordReader();
        }  
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
        }

     private class VcfRecordReader extends LineRecordReader
        {
        /* reads all lines starting with '#' */
         @Override
        public void initialize(InputSplit genericSplit,
                TaskAttemptContext context) throws IOException {
            super.initialize(genericSplit, context);
            List<String> headerLines=new ArrayList<String>();
            while( super.nextKeyValue())
                {
                String row = super.getCurrentValue().toString();
                if(!row.startsWith("#")) throw new IOException("Bad VCF header");
                headerLines.add(row);
                if(row.startsWith("#CHROM")) break;
                }
            }
        }
    }

现在，在 Mapper 中，有没有一种方法可以让指针指向 VcfInputFormat.this.headerLines 以便对行进行解码？

  public static class VcfMapper
       extends Mapper<LongWritable, Text, Text, IntWritable>{

    public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
      my.VcfCodec codec=new my.VcfCodec(???????.headerLines);
      my.Variant variant =codec.decode(value.toString());
      //(....)
    }
  }

最佳答案

我认为您的情况与您链接到的示例不同。在这种情况下， header 在自定义 RecordReader 类中使用，以提供单个“当前值”，它是由所有过滤词组成的一行，并传递给映射器。但是，在您的情况下，您想在 RecordReader 之外使用 header 信息，即在您的映射器中，这是无法实现的。

我还认为您也可以通过提供已处理的信息来模仿链接示例的行为:通过读取 header 、存储它们然后在获取当前值时，您的映射器可以接收一个 my.VcfCodec 对象而不是 Text 对象(即 getCurrentValue 方法返回一个 my.VcfCodec 对象)。您的映射器可能类似于...

public static class VcfMapper extends Mapper<LongWritable, my.VcfCodec, Text, IntWritable>{
    public void map(LongWritable key, my.VcfCodec value, Context context ) throws IOException, InterruptedException {
        // whatever you may want to do with the encoded data...
}

关于java - hadoop mapreduce : handling a text file with a header，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30052859/

java - hadoop mapreduce : handling a text file with a header

上一篇：Hadoop combiner 在 reducer 上执行

下一篇：python - 如何让我的 Hadoop python 映射器工作？