amazon-web-services - 亚马逊 emr : best compression/fileformat

标签 amazon-web-services hadoop compression apache-pig amazon-emr

我们目前有一些文件存储在 S3 服务器上。这些文件是经过 gzip 压缩以减少磁盘空间的日志文件(.log 扩展名，但内容为纯文本)。但是 gzip 不可拆分，现在我们正在寻找一些好的替代方案来在 Amazon EMR 上存储/处理我们的文件。

那么对日志文件使用的最佳压缩或文件格式是什么？我遇到了 avro 和 SequenceFile、bzip2、LZO 和 snappy。有点多，我有点不知所措。

因此，如果您对此有任何见解，我将不胜感激。

数据将用于 pig 作业(map/reduce 作业)

亲切的问候

最佳答案

如果您检查 Best Practices for Amazon EMR有一节讨论压缩输出:

Compress mapper outputs–Compression means less data written to disk, which improves disk I/O. You can monitor how much data written to disk by looking at FILE_BYTES_WRITTEN Hadoop metric. Compression can also help with the shuffle phase where reducers pull data. Compression can benefit your cluster HDFS data replication as well. Enable compression by setting mapred.compress.map.output to true. When you enable compression, you can also choose the compression algorithm. LZO has better performance and is faster to compress and decompress.

关于amazon-web-services - 亚马逊 emr : best compression/fileformat，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23251118/

上一篇：hadoop - 在 snappy 压缩列族上运行 hadoop 作业

下一篇：hadoop - 使用 Python 运行 MapReduce 流作业时出错

java - 如何通过网络将数据从一个HDFS集群迁移到另一个集群？

python - 为什么 python 不能执行通过 stdin 传递的 zip 存档？

node.js - 在 AWS EC2 上部署 Angular 5 应用程序

hadoop - 使用java.lang.NoClassDefFoundError在AWS EMR上运行的Pig UDF:org/apache/pig/LoadFunc

amazon-web-services - 如何查看给定 CloudFormation 资源上的 DeletionPolicy？

hadoop - Oozie param标签在脚本标签之前？

compression - 什么是 DirectX 中的 'typeless' DXGI 纹理格式？

c# - 将压缩数据写入 NetworkStream

amazon-web-services - 删除具有拒绝所有策略和 VPC 条件的 S3 存储桶