我正在使用MapReduce将文本文件转换为Sequence文件,然后再转换回Text。
我在每一行的开头都得到一些数字。如何删除它们或阻止它们出现在我的输出中。
例如文字:
d001 Marketing
d002 Finance
d003 Human Resources
转换后的序列文件:
0 d001 Marketing
15 d002 Finance\n
28 d003 Human Resources
来自序列文件的转换后的文本
0 d001 Marketing
15 d002 Finance
28 d003 Human Resources
我希望删除0 15 28个值。
我正在使用以下代码:
public class FormatConverterTextToSequenceDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Two parameters are required for FormatConverterTextToSequenceDriver-<input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(FormatConverterTextToSequenceDriver.class);
job.setJobName("Create Sequence File, from text file");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(FormatConverterMapper.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
-----------------------------------------------------------------
public class FormatConverterSequenceToTextDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters need to be supplied - <input dir> and <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(FormatConverterSequenceToTextDriver.class);
job.setJobName("Convert Sequence File and Output as Text");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(FormatConverterMapper.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
-----------------------------------------------------------------
public class FormatConverterMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
任何帮助表示赞赏。
最佳答案
当您从序列文件转换回文本时,您不想添加您写的长文本。因此,只需将您的write方法调整为:
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(value, null);
}
输出应该只是值本身。
关于java - 使用MapReduce将文本转换为序列会创建垃圾字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27859556/