hadoop - Hadoop 中 Amazon S3 和 S3n 的区别

当我将我的 Hadoop 集群连接到 Amazon 存储并将文件下载到 HDFS 时，我发现 s3:// 不起作用。在 Internet 上寻找帮助时，我发现我可以使用 S3n。当我使用 S3n 时，它起作用了。我不明白在我的 Hadoop 集群中使用 S3 和 S3n 之间的区别，有人可以解释一下吗？

最佳答案

使用Amazon S3的两个文件系统记录在相应的 Hadoop wiki page addressing Amazon S3 中:

S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS using the S3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either S3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. [...]

[emphasis mine]

所以差异主要与 5GB 限制的处理方式有关(这是可以在单个 PUT 中上传的最大对象，即使对象的大小范围可以从 1字节到 5 TB，请参阅 How much data can I store?):使用 S3 block 文件系统(URI 方案:s3) 可以弥补 5GB 的限制并存储高达 5TB 的文件，它取代了 HDFS反过来。

关于hadoop - Hadoop 中 Amazon S3 和 S3n 的区别，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10569455/

hadoop - Hadoop 中 Amazon S3 和 S3n 的区别

上一篇：hadoop - 在 Hive 中执行任何查询时，有什么方法可以获取列名和输出吗？

下一篇：hadoop - Hive将文件存放在HDFS的什么地方？