hadoop - 如何选择${mapred.local.dir}?

标签 hadoop dictionary mapreduce hadoop-streaming

如果我配置了几个 ${mapred.local.dir} 目录来存储 Map Task 的即时结果,这些目录挂载在不同的磁盘上。 我的问题是: 1. LocalDirAllocator.java是否用于管理${mapred.local.dir}目录?

2.LocalDirAllocator.java的方法getLocalPathForWrite()是用来选择一个${mapred.local.dir}目录的?

最佳答案

1. Whether LocalDirAllocator.java is used to manage ${mapred.local.dir} directories?

是的,tasktracker 使用LocalDirAllocator 来管理本地目录/磁盘以存储中间数据。(解释中给出了它分配空间的方式)

2.The method getLocalPathForWrite() of LocalDirAllocator.java is used to select a ${mapred.local.dir} directory?

LocalDirAllocator 中有 3 个重载方法对应于 getLocalPathForWrite()。它们轮询磁盘集(通过配置的目录)并返回第一个完整的有足够空间的路径。

解释 来自 java 文档:LocalDirAllocator.java

An implementation of a round-robin scheme for disk allocation for creating files. The way it works is that it is kept track what disk was last allocated for a file write. For the current request, the next disk from the set of disks would be allocated if the free space on the disk is sufficient enough to accommodate the file that is being considered for creation. If the space requirements cannot be met, the next disk in order would be tried and so on till a disk is found with sufficient capacity. Once a disk with sufficient space is identified, a check is done to make sure that the disk is writable. Also, there is an API provided that doesn't take the space requirements into consideration but just checks whether the disk under consideration is writable (this should be used for cases where the file size is not known apriori). An API is provided to read a path that was created earlier. That API works by doing a scan of all the disks for the input pathname. This implementation also provides the functionality of having multiple allocators per JVM (one for each unique functionality or context, like mapred, dfs-client, etc.). It ensures that there is only one instance of an allocator per context per JVM.

注意:

  1. The contexts referred above are actually the configuration items defined in the Configuration class like "mapred.local.dir" (for which we want to control the dir allocations). The context-strings are exactly those configuration items.

  2. This implementation does not take into consideration cases where a disk becomes read-only or goes out of space while a file is being written to (disks are shared between multiple processes, and so the latter situation is probable).

    1. In the class implementation, "Disk" is referred to as "Dir", which actually points to the configured directory on the Disk which will be the parent for all file write/read allocations.

我不认为我们可以直接覆盖它的行为,除非我们覆盖它的依赖者的行为!

关于hadoop - 如何选择${mapred.local.dir}?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27100864/

相关文章:

apache-spark - 什么决定了 Parquet 文件缓冲区的大小

hadoop - 在生产环境中以本地模式运行 Hadoop

hadoop - 由于空间问题导致 Spark 作业失败

hadoop - HBase 0.95.1在hadoop-2.0.5 alpha上失败

swift - 如何从 facebook 好友请求中提取信息到 Swift 中的 ID 数组

python - 为什么 self.__add__(a) 和 self + a 在字典中没有返回相同的值?

java - 仅限 map 的工作 - 订单

hadoop - Mapper 或 Reducer Task 中未处理的异常会使任务失败?

java - Hbase 客户端 ConnectionLoss for/hbase 报错

c# - 是否可以将 Func<T> 存储在字典中?