hadoop - 在 Hortonworks Hadoop (AWS EC2) 上访问 WebHDFS

我在 Amazon EC2 机器上遇到 WebHDFS 访问问题。顺便说一句，我已经安装了 Hortonworks HDP 2.3。

我可以使用以下 http 请求在浏览器 (chrome) 中从我的本地计算机检索文件状态:

http://<serverip>:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS

这工作正常，但如果我尝试使用 ?op=OPEN 打开文件，它会将我重定向到我无法访问的机器的私有(private) DNS:

http://<privatedns>:50075/webhdfs/v1/user/admin/file.csv?op=OPEN&namenoderpcaddress=<privatedns>:8020&offset=0

我还尝试使用以下命令从 AWS 机器本身访问 WebHDFS:

[ec2-user@<ip> conf]$ curl -i http://localhost:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
curl: (7) couldn't connect to host

有谁知道为什么我无法连接到本地主机或者为什么我的本地计算机上的 OPEN 不起作用？不幸的是，我找不到任何为亚马逊机器配置 WebHDFS 的教程。

提前致谢

最佳答案

发生的事情是名称节点将您重定向到数据节点。似乎您安装了一个单节点集群，但从概念上讲，名称节点和数据节点是不同的，并且在您的配置中，数据节点在 EC2 VPC 的私有(private)端运行/监听。

您可以重新配置集群以在公共(public) IP/DNS 上托管数据节点(参见 HDFS Support for Multihomed Networks )，但我不会那样做。我认为正确的解决方案是添加一个 Know gateway ，这是一个专门用于从公共(public) API 访问私有(private)集群的组件。具体来说，您必须配置数据节点 URL，请参阅 Chapter 5. Mapping the Internal Nodes to External URLs .那里的例子似乎适合你的情况:

For example, when uploading a file with WebHDFS service:

The external client sends a request to the gateway WebHDFS service.

The gateway proxies the request to WebHDFS using the service URL.

WebHDFS determines which DataNodes to create the file on and returns the path for the upload as a Location header in a HTTP redirect, which contains the datanode host information.

The gateway augments the routing policy based on the datanode hostname in the redirect by mapping it to the externally resolvable hostname.

The external client continues to upload the file through the gateway.

The gateway proxies the request to the datanode by using the augmented routing policy.

The datanode returns the status of the upload and the gateway again translates the information without exposing any internal cluster details.

关于hadoop - 在 Hortonworks Hadoop (AWS EC2) 上访问 WebHDFS，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35311308/

hadoop - 在 Hortonworks Hadoop (AWS EC2) 上访问 WebHDFS

上一篇：performance - Apache Spark 分布式环境调优

下一篇：hadoop - Hive 禁用历史日志和查询日志