java - Hadoop、MapReduce - 多输入/输出路径

在为我的 MapReduce 作业制作 Jar 时，在我的输入文件中，我使用了 Hadoop-local 命令。我想知道是否有一种方法，而不是专门指定我的输入文件夹中要在 MapReduce 作业中使用的每个文件的路径，我是否可以只指定并传递我的输入文件夹中的所有文件。这是因为由于我尝试配置的 MapReduce 作业的性质，文件的内容和数量可能会发生变化，而且我不知道文件的具体数量，除了这些文件的内容之外，有没有办法将输入文件夹中的所有文件传递到我的 MapReduce 程序，然后遍历每个文件以计算某个函数，然后将结果发送到 Reducer。我只使用一个 Map/Reduce 程序并且我正在用 Java 编码。我可以使用 hadoop-moonshot 命令，但目前我正在使用 hadoop-local。

谢谢。

最佳答案

您不必将单个文件作为 MapReduce 作业的输入。

FileInputFormat类已经提供 API 来接受多个文件列表作为 Map Reduce 程序的输入。

public static void setInputPaths(Job job,
                 Path... inputPaths)
                          throws IOException

Add a Path to the list of inputs for the map-reduce job. Parameters:

conf - The configuration of the job

path - Path to be added to the list of inputs for the map-reduce job.

来自 Apache 的示例代码 tutorial

Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));

MultipleInputs提供以下 API。

public static void addInputPath(Job job,
                Path path,
                Class<? extends InputFormat> inputFormatClass,
                Class<? extends Mapper> mapperClass)

Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.

引用MultipleOutputs关于您对多个输出路径的第二个查询的 API。

FileOutputFormat.setOutputPath(job, outDir);

// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);

// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);

查看有关多个输出文件的相关 SE 问题。

Writing to multiple folders in hadoop?

hadoop method to send output to multiple directories

关于java - Hadoop、MapReduce - 多输入/输出路径，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37229646/

java - Hadoop、MapReduce - 多输入/输出路径

上一篇：java - 使用 Java 将文件写入 HDFS

下一篇：hadoop - [Simba][ImpalaJDBCDriver](500051) 处理查询/语句时出错