我使用一个包含 16 个节点的 hadoop(版本 1.2.0)集群,其中一个具有公共(public) IP(主节点),另外 15 个节点通过专用网络连接(从节点)。
是否可以使用远程服务器(除了这 16 个节点之外)来存储映射器的输出?问题是节点在映射阶段耗尽磁盘空间我无法再压缩 map 输出。
我知道mapred-site.xml
中的mapred.local.dir
用于设置存储tmp文件的逗号分隔的目录列表。理想情况下,我希望在远程服务器上有一个本地目录(默认目录)和一个目录。当本地磁盘满了时,我想使用远程磁盘。
最佳答案
我对此不太确定,但根据链接( http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml ),它说:
The local directory is a directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
您还应该查看一些其他属性。这些可能会有所帮助:
mapreduce.tasktracker.local.dir.minspacestart: If the space in mapreduce.cluster.local.dir drops under this, do not ask for more tasks. Value in bytes
mapreduce.tasktracker.local.dir.minspacekill: If the space in mapreduce.cluster.local.dir drops under this, do not ask more tasks until all the current ones have finished and cleaned up. Also, to save the rest of the tasks we have running, kill one of them, to clean up some space. Start with the reduce tasks, then go with the ones that have finished the least. Value in bytes.
关于hadoop - 将mapreduce中间输出存储在远程服务器上,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26648825/