hadoop - 我可以将输入的Mapper设置为hashMap而不是输入文件

标签 hadoop mapreduce mapper

我正在尝试设置一个利用dynamodb的并行扫描功能的MapReduce任务。

基本上,我希望每个Mapper类都采用一个元组作为输入值。

到目前为止,我所看到的每个示例都设置为:

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

我可以将作业的输入格式设置为hashMap吗?

最佳答案

我认为您想将文件作为键值对读取,而不是作为读取inputSlipt(以行号为键,以行为值)的标准方式读取。如果您提出要求,则可以使用 KeyValueTextInputFormat 以下描述可以在Hadoop上找到:权威指南

KeyValueTextInputFormat
TextInputFormat’s keys, being simply the offset within the file, are not normally
very useful. It is common for each line in a file to be a key-value pair, 
separated by a delimiter such as a tab character. For example, this is the output   
produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such 
files correctly, KeyValueTextInputFormat is appropriate.

You can specify the separator via the key.value.separator.in.input.line property. 
It is a tab character by default. Consider the following input file, 
where → represents a (horizontal) tab character:

line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:

(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)

关于hadoop - 我可以将输入的Mapper设置为hashMap而不是输入文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16995617/

相关文章:

java - 如何通过jobClient自动提交jar到hadoop

mongodb - Map-Reduce count 每分钟 MongoDB 的文档数

java - Hadoop 映射器输出到 HBase 表和一个缩减器

hadoop - 在这种情况下,如何编写mapreduce?

java - 如何从 objectMapper 获取 "unset"injectableValues?

hadoop - Hadoop 3.2.1 ErasureCoding ISA-L问题?

hadoop - 为什么我们需要 MapReduce 中的 "map"部分?

hadoop - hadoop fs上的internal.S3AbortableInputStream-将s3获取到EMR

Java 作用域错误