hadoop - MultitpleOutputFormat-Hadoop

标签 hadoop mapreduce

map-reduce有点新,所以如果有人可以用以下问题指导我,那将是很棒的

  • 我使用多输出格式来写入以在map reduce中分离输出文件。假设我的输入文件包含“水果和蔬菜”,因此将其拆分为两个文件。水果和蔬菜如下。

    水果r-00000,蔬菜r-00000,Part-r-00000

    对要运行多少个 reducer 感到困惑?我知道默认情况下,reducer的数量设置为1,并且由于文件名的数量部分相同,所以我相信只有一个reducer运行。我的理解正确吗?
    为什么还要创建part-r-00000文件?我将所有输出写入“水果”文件或“蔬菜”文件中。
  • 如果我有1 GB的数据要处理,我将如何决定要使用的最佳 reducer 数量?
  • 最佳答案

    one reducer will run ,it has nothing to do with part of file name , no of reducer would be either specified by the user by default it calculated the size of the input file and amount of work which need to be done in reducers .
    
    part-r-00000 : This is related with partitioning, Since we have one reducer so all partitions will point to this file 
    
    Number of reduces in most cases specified by users. It mostly depends on amount of work, which need to be done in reducers. But their number should not be very big, because of algorithm, used by Mapper to distribute data among reducers. Some frameworks, like Hive can calculate number of reducers using empirical 1GB output per reducer.
    

    关于hadoop - MultitpleOutputFormat-Hadoop,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26232216/

    相关文章:

    scala - 转换 RDD 中的字符串集合

    hadoop - 分组相应的键和值

    Hadoop 与凤凰 : how to write the phoenix table object to hdfs filesystem

    hadoop - 是否有一个好的库可以帮助使用 Hadoop Streaming 和 Python 链接 MapReduce 作业?

    Hadoop FileInputFormat isSplitable false

    hadoop - Nutch抓取深度为='N'的爬行与N次使用深度='1'的循环爬行之间的区别

    amazon-s3 - 我应该如何对 s3 中的数据进行分区以便与 hadoop hive 一起使用?

    hadoop - 在 hadoop 服务器上运行 jar 作为服务

    hadoop - 如何配置oozie job.properties以在特定时间跳过作业?

    performance - 哪些指标可衡量MapReduce应用程序的效率?