hadoop - Hadoop 中 Amazon S3 和 S3n 的区别

标签 hadoop amazon-s3 hdfs

当我将我的 Hadoop 集群连接到 Amazon 存储并将文件下载到 HDFS 时,我发现 s3:// 不起作用。在 Internet 上寻找帮助时,我发现我可以使用 S3n。当我使用 S3n 时,它起作用了。我不明白在我的 Hadoop 集群中使用 S3S3n 之间的区别,有人可以解释一下吗?

最佳答案

使用Amazon S3的两个文件系统记录在相应的 Hadoop wiki page addressing Amazon S3 中:

  • S3 Native FileSystem (URI scheme: s3n)
    A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

  • S3 Block FileSystem (URI scheme: s3)
    A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS using the S3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either S3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. [...]

[emphasis mine]

所以差异主要与 5GB 限制的处理方式有关(这是可以在单个 PUT 中上传的最大对象,即使对象的大小范围可以从 1字节到 5 TB,请参阅 How much data can I store?):使用 S3 block 文件系统(URI 方案:s3) 可以弥补 5GB 的限制并存储高达 5TB 的文件,它取代了 HDFS反过来。

关于hadoop - Hadoop 中 Amazon S3 和 S3n 的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10569455/

相关文章:

hadoop - 如果我使用 S3 而不是 HDFS,是否仍然需要 Namenode?

java - 多个输入的 Hadoop MapReduce

hadoop - 在Pig Latin中查找文件中的逗号数

hadoop - 可以 reduce task 在hadoop中接受压缩数据

hadoop - hadoop安装路径在节点之间是否应该相同

hadoop - Hadoop性能基准

hadoop - 是否可以在单 Spark 上下文中收听两个 dtsream?

node.js - 检测 S3 对象的内容类型/MIME 类型

amazon-web-services - Amazon S3 上的自定义 header

java - AWS Java SDK - AWS 身份验证需要有效的 Date 或 x-amz-date header