java - 使用MapReduce将文本转换为序列会创建垃圾字符

我正在使用MapReduce将文本文件转换为Sequence文件，然后再转换回Text。
我在每一行的开头都得到一些数字。如何删除它们或阻止它们出现在我的输出中。

例如文字:

d001    Marketing

d002    Finance

d003    Human Resources

转换后的序列文件:

0   d001    Marketing

15  d002    Finance\n

28  d003    Human Resources

来自序列文件的转换后的文本

0   d001    Marketing

15  d002    Finance

28  d003    Human Resources

我希望删除0 15 28个值。

我正在使用以下代码:

public class FormatConverterTextToSequenceDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {

    if (args.length != 2) {
      System.out.printf("Two parameters are required for FormatConverterTextToSequenceDriver-<input dir> <output dir>\n");
      return -1;
    }

    Job job = new Job(getConf());
    job.setJarByClass(FormatConverterTextToSequenceDriver.class);
    job.setJobName("Create Sequence File, from text file");

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(FormatConverterMapper.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    job.setNumReduceTasks(0);

    boolean success = job.waitForCompletion(true);
    return success ? 0 : 1;
  }
 -----------------------------------------------------------------
public class FormatConverterSequenceToTextDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {

    if (args.length != 2) {
      System.out
          .printf("Two parameters need to be supplied - <input dir> and <output dir>\n");
      return -1;
    }

    Job job = new Job(getConf());
    job.setJarByClass(FormatConverterSequenceToTextDriver.class);
    job.setJobName("Convert Sequence File and Output as Text");

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setMapperClass(FormatConverterMapper.class);
    job.setNumReduceTasks(0);

    boolean success = job.waitForCompletion(true);
    return success ? 0 : 1;
  }
 -----------------------------------------------------------------
public class FormatConverterMapper extends
    Mapper<LongWritable, Text, LongWritable, Text> {

  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    context.write(key, value);
  }
}

任何帮助表示赞赏。

最佳答案

当您从序列文件转换回文本时，您不想添加您写的长文本。因此，只需将您的write方法调整为:

 @Override
 public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    context.write(value, null);
  }

输出应该只是值本身。

关于java - 使用MapReduce将文本转换为序列会创建垃圾字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27859556/

java - 使用MapReduce将文本转换为序列会创建垃圾字符

上一篇：maven - 无法为Windows中的Hadoop安装设置环境

下一篇：hadoop - 在mapper中设置conf值-在run方法中获取它