目标:
- 我希望能够指定输入文件中使用的映射器数量
- 同样,我想指定每个映射器将占用的文件行数
简单示例:
对于 10 行的输入文件(长度不等;下面的示例),我希望有 2 个映射器——因此每个映射器将处理 5 行。
This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words
这是我的:
(我有它,以便每个映射器生成一个“
package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
public class Test {
// prduce one "<map,1>" pair per mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text("map"), one);
}
}
// reduce by taking a sum
public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job1 = Job.getInstance(conf, "pass01");
job1.setJarByClass(Test.class);
job1.setMapperClass(Map.class);
job1.setCombinerClass(Red.class);
job1.setReducerClass(Red.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
// // Attempt#1
// conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#2
// NLineInputFormat.setNumLinesPerSplit(job1, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#3
// conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#4
// conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
// conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
上面的代码,使用上面的示例数据,会产生
map 10
我想要的输出是
map 2
第一个映射器将在前 5 行执行操作,第二个映射器将在后 5 行执行操作。
最佳答案
你可以使用 NLineInputFormat .
使用 NLineInputFormat
功能,您可以准确指定应将多少行发送到映射器。
例如。如果您的文件有 500 行,并且您将每个映射器的行数设置为 10,则您有 50 个映射器
(而不是一个 - 假设文件小于 HDFS block 大小)。
编辑:
下面是一个使用 NLineInputFormat 的例子:
映射器类:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
驱动类:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}
使用您提供的输入,上述示例映射器的输出将写入两个文件,因为 2 个映射器已初始化:
部分-m-00001
0 This is
8 an arbitrary example file
34 of 10 lines.
47 Each line does
62 not have to be
部分-m-00002
77 of
80 the same
89 length or contain
107 the same
116 number of words
关于java - MapReduce:如何让映射器处理多行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26821652/