java - 使用MapReduce将文本转换为序列会创建垃圾字符

标签 java hadoop mapreduce hadoop2 sequencefile

我正在使用MapReduce将文本文件转换为Sequence文件,然后再转换回Text。
我在每一行的开头都得到一些数字。如何删除它们或阻止它们出现在我的输出中。

例如文字:

d001    Marketing

d002    Finance

d003    Human Resources

转换后的序列文件:
0   d001    Marketing

15  d002    Finance\n

28  d003    Human Resources

来自序列文件的转换后的文本
0   d001    Marketing

15  d002    Finance

28  d003    Human Resources

我希望删除0 15 28个值。

我正在使用以下代码:
public class FormatConverterTextToSequenceDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {

    if (args.length != 2) {
      System.out.printf("Two parameters are required for FormatConverterTextToSequenceDriver-<input dir> <output dir>\n");
      return -1;
    }

    Job job = new Job(getConf());
    job.setJarByClass(FormatConverterTextToSequenceDriver.class);
    job.setJobName("Create Sequence File, from text file");

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(FormatConverterMapper.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    job.setNumReduceTasks(0);

    boolean success = job.waitForCompletion(true);
    return success ? 0 : 1;
  }
 -----------------------------------------------------------------
public class FormatConverterSequenceToTextDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {

    if (args.length != 2) {
      System.out
          .printf("Two parameters need to be supplied - <input dir> and <output dir>\n");
      return -1;
    }

    Job job = new Job(getConf());
    job.setJarByClass(FormatConverterSequenceToTextDriver.class);
    job.setJobName("Convert Sequence File and Output as Text");

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setMapperClass(FormatConverterMapper.class);
    job.setNumReduceTasks(0);

    boolean success = job.waitForCompletion(true);
    return success ? 0 : 1;
  }
 -----------------------------------------------------------------
public class FormatConverterMapper extends
    Mapper<LongWritable, Text, LongWritable, Text> {

  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    context.write(key, value);
  }
}

任何帮助表示赞赏。

最佳答案

当您从序列文件转换回文本时,您不想添加您写的长文本。因此,只需将您的write方法调整为:

 @Override
 public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    context.write(value, null);
  }

输出应该只是值本身。

关于java - 使用MapReduce将文本转换为序列会创建垃圾字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27859556/

相关文章:

Java Map Reduce 从不同格式读取 - Avro、文本文件

java - 映射器和缩减器的数量。这是什么意思?

hadoop - 压缩对MapReduce作业有什么影响?

java - 同步惰性初始化时两次检查是否为空的原因是什么?

java - spring boot application.properties 文件不会自动完成代码

unix - HBASE_HOME为空,并导致 “Could not locate executable null\bin\winutils.exe in the Hadoop binaries”错误

hadoop - 如何解决错误 "file:/user/hive/warehouse/records is not a directory or unable to create one"?

java - 模态 JDialog 的 1x1 维度

java - 配置失败 : @AfterClass teardown error in appium

linux - ambari + API 语法以更改 ambari 服务的参数