apache - Hadoop 2.6.4和大文件

标签 apache hadoop hdfs

我是Apache Hadoop的新用户。有一个我不明白的时刻。我有一个简单的群集(3个节点)。每个节点大约有30GB的可用空间。当我查看Hadoop的Overview网站时，我看到了DFS剩余部分:90.96 GB。我将复制因子设置为1。

然后，我创建一个文件50GB，然后尝试将其上传到HDFS。但是空间已经耗尽。为什么？我不能上传超过一个节点群集可用空间的文件吗？

最佳答案

根据Hadoop:权威指南

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack. This logic makes sense as it decreases the network chatter between the different nodes.

我认为这取决于客户端是否与Hadoop节点相同。如果客户端是Hadoop节点，则所有拆分都将在同一节点上。尽管群集中有多个节点，但这并不能提供更好的读写吞吐量。如果客户端与Hadoop节点不同，则会为每个拆分随机选择该节点，因此拆分将分散在群集中的各个节点上。现在，这提供了更好的读/写吞吐量。

关于apache - Hadoop 2.6.4和大文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36567568/

上一篇：java - 格式化Hadoop中的namenode:IllegalArgumentException Uri具有权限组件

下一篇：docker - Docker存储-获得Layman的答案

相关文章：

hadoop - Distcp到webHDFS降低了作业跟踪器

java - 如何使用 Apache POI 在 Java 中读取 Excel 的合并单元格？

python - 数据库连接错误: Centos 6/Apache 2. 4/Postgres 9.4/Django 1.9/mod_wsgi 3.5/python 2.7

hadoop - 在哪里上传hdfs文件？

hadoop - 关于分布式运行在hadoop上的hbase

hadoop - 如何使用Hadoop GIS框架加载空间数据

hadoop - hdfs datanode无法回收本地磁盘空间，如果在关闭一段时间后重新启动

每个页面上的 PHP 函数导致重定向循环

apache - 不存在的php文件的自定义404页面

hadoop - 用于实验的免费 Hadoop 集群