hadoop - getCacheFiles() 和 getLocalCacheFiles() 是一样的吗？

作为getLocalCacheFiles()已弃用，我正在尝试寻找替代方案。 getCacheFiles()似乎是一个，但我怀疑它们是否相同。

当您调用 addCacheFile() 时, HDFS 中的文件将被下载到每个节点，使用 getLocalCacheFiles() 你可以获得 localized 文件路径，你可以从本地文件系统读取它。但是，getCacheFiles() 返回的是文件在 HDFS 中的 URI。如果你通过这个 URI 读取文件，我怀疑你仍然从 HDFS 而不是本地文件系统读取。

以上是我的理解，不知道对不对。如果是这样，getLocalCacheFiles() 的替代方法是什么？为什么 Hadoop 首先弃用它？

最佳答案

它是开源的。你总能找到引入 @Deprectated 的 git blame:commit 735b50e8bd23f7fbeff3a08cf8f3fff8cbff7449 , 这是 MAPREDUCE-4493 .在 JIRA 的尾部，您会发现这个讨论:

Omkar Vinit Joshi added a comment - 13/Jul/13 00:18
Robert Joseph Evans if we are deprecating getLocalCacheFiles and getCacheFiles in jobContext() then how the user is going to get local cached files in map task? YARN-916 is the related issue.. Thanks.

Robert Joseph Evans added a comment - 19/Jul/13 15:27
Omkar Vinit Joshi By opening the symbolic link in the current working directory. Prior to YARN the default behavior was to not create symlinks in the current working directory pointing to the items in the distributed cache. If you wanted links you had to specifically turn that option on and provide the name of the symlink you wanted. The only way to get to files without symlinks was to call getLocalCacheFiles and getCacheFiles. In YARN all files will have a symlink created. The name of the file/directory will be the name of the symlink. However, it is possible to have a name collision where I wanted hdfs://foo/bar.zip and hdfs://bar/bar.zip. In 1.0 both of these would have been downloaded and accessible through the deprecated APIs, but in YARN a warning will be output and only one of them will be downloaded. Also because of the way these APIs were written the mapper code may not know that only one of them was downloaded and will not be able to find the missing one and blow up. That is why I deprecated them in favor of nudging people to always use the symlinks so the behavior is always consistent.

Omkar Vinit Joshi added a comment - 19/Jul/13 16:56
Robert Joseph Evans sounds good.. however by this we will be putting limitation based on file name ..but that sounds reasonable considering the fact that this will stop potential bugs in map code and users can definitely version them to avoid it... Thanks...

所以你应该只打开文件，它就会在那里。没有专用的 API。

关于hadoop - getCacheFiles() 和 getLocalCacheFiles() 是一样的吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26492964/

hadoop - getCacheFiles() 和 getLocalCacheFiles() 是一样的吗？

上一篇：java - 如何使用正则表达式 serde for::作为文件中的定界符

下一篇：java - OpenJDK 客户端 VM - 无法分配内存