hadoop - getCacheFiles() 和 getLocalCacheFiles() 是一样的吗?

标签 hadoop mapreduce hdfs distributed-cache

作为getLocalCacheFiles()已弃用,我正在尝试寻找替代方案。 getCacheFiles()似乎是一个,但我怀疑它们是否相同。

当您调用 addCacheFile() 时, HDFS 中的文件将被下载到每个节点,使用 getLocalCacheFiles() 你可以获得 localized 文件路径,你可以从本地文件系统读取它。但是,getCacheFiles() 返回的是文件在 HDFS 中的 URI。如果你通过这个 URI 读取文件,我怀疑你仍然从 HDFS 而不是本地文件系统读取。

以上是我的理解,不知道对不对。如果是这样,getLocalCacheFiles() 的替代方法是什么?为什么 Hadoop 首先弃用它?

最佳答案

它是开源的。你总能找到引入 @Deprectated 的 git blame:commit 735b50e8bd23f7fbeff3a08cf8f3fff8cbff7449 , 这是 MAPREDUCE-4493 .在 JIRA 的尾部,您会发现这个讨论:

Omkar Vinit Joshi added a comment - 13/Jul/13 00:18
Robert Joseph Evans if we are deprecating getLocalCacheFiles and getCacheFiles in jobContext() then how the user is going to get local cached files in map task? YARN-916 is the related issue.. Thanks.

Robert Joseph Evans added a comment - 19/Jul/13 15:27
Omkar Vinit Joshi By opening the symbolic link in the current working directory. Prior to YARN the default behavior was to not create symlinks in the current working directory pointing to the items in the distributed cache. If you wanted links you had to specifically turn that option on and provide the name of the symlink you wanted. The only way to get to files without symlinks was to call getLocalCacheFiles and getCacheFiles. In YARN all files will have a symlink created. The name of the file/directory will be the name of the symlink. However, it is possible to have a name collision where I wanted hdfs://foo/bar.zip and hdfs://bar/bar.zip. In 1.0 both of these would have been downloaded and accessible through the deprecated APIs, but in YARN a warning will be output and only one of them will be downloaded. Also because of the way these APIs were written the mapper code may not know that only one of them was downloaded and will not be able to find the missing one and blow up. That is why I deprecated them in favor of nudging people to always use the symlinks so the behavior is always consistent.

Omkar Vinit Joshi added a comment - 19/Jul/13 16:56
Robert Joseph Evans sounds good.. however by this we will be putting limitation based on file name ..but that sounds reasonable considering the fact that this will stop potential bugs in map code and users can definitely version them to avoid it... Thanks...

所以你应该只打开文件,它就会在那里。没有专用的 API。

关于hadoop - getCacheFiles() 和 getLocalCacheFiles() 是一样的吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26492964/

相关文章:

regex - 将正则表达式应用于解释器 Flume 配置中 "|"分隔字符串中的第二个单词

hadoop distcp 引发无法找到或加载主类 org.apache.hadoop.mapreduce.v2.app.MRAppMaster

hadoop - 流数据和 Hadoop? (不是 Hadoop 流)

hadoop - 尝试从HDFS读取文件时,Pentaho的 “Hadoop File Input”(勺子)始终显示错误

hadoop - 每周从 FTP 下载文件到 HDFS

hadoop - Sqoop 导入 : composite primary key and textual primary key

hadoop - 如何在 HADOOP 中并行运行多个迭代作业

hadoop - 什么会导致 hadoop kill reducer 任务重试

hadoop - MapReduce中多个输入路径中的错误

hadoop - 无法在 hdfs 的目录下创建子目录