amazon-web-services - 当从 S3 读取时,为什么我的 LZO 索引在 Amazon EMR 上需要很长时间?

标签 amazon-web-services amazon-s3 amazon-emr lzo hadoop-lzo

我在 S3 上有一个 30GB lzo 文件,我正在使用 hadoop-lzo 通过 Amazon EMR (AMI v2.4.2) 并使用区域 us-east1 为其建立索引。

elastic-mapreduce --create --enable-debugging \
    --ami-version "latest" \
    --log-uri s3n://mybucket/mylogs \
    --name "lzo index test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg s3://mybucket/my-30gb-file.lzo \
      --step-name "Index LZO files"

1% 的进度大约需要 10 分钟,因此一个文件完成大约需要 16 小时。进度显示仅读取了 80mb。

相比之下,使用相同的集群(当上述作业正在运行时),我可以将文件从 S3 复制到本地硬盘,然后复制到 HDFS,最后运行索引器,总时间约为 10 分钟。同样,我的本地集群可以在大约 7 分钟内处理此问题。

过去,我相信我直接在 S3 上运行 LZO 索引,没有出现这么严重的延迟,尽管它是在早期的 AMI 版本上。我不知道我使用的是什么 AMI,因为我总是使用“最新”。 (更新:我尝试了 AMI v2.2.4,结果相同,所以可能是我记错了,或者是其他原因导致速度缓慢)

有什么想法会发生什么吗?

这是步骤日志输出的副本:

Task Logs: 'attempt_201401011330_0001_m_000000_0'


stdout logs



stderr logs



syslog logs

2014-01-01 13:32:39,764 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2014-01-01 13:32:40,043 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2014-01-01 13:32:40,120 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-7-132-249.ec2.internal
2014-01-01 13:32:40,134 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2014-01-01 13:32:40,138 INFO org.apache.hadoop.mapred.Task (main):  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c785f0b
2014-01-01 13:32:40,943 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2014-01-01 13:32:41,104 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2014-01-01 13:32:41,104 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2014-01-01 13:32:41,121 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2014-01-01 13:32:41,121 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2014-01-01 13:32:41,314 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://mybucket/my-30gb-file.lzo' for reading
2014-01-01 13:32:41,478 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '63624'
2014-01-01 13:32:41,773 INFO com.hadoop.mapreduce.LzoIndexRecordWriter (main): Setting up output stream to write index file for s3://mybucket/my-30gb-file.lzo
2014-01-01 13:32:41,885 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index.tmp' but file does not exist, so returning false
2014-01-01 13:32:41,928 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index' but file does not exist, so returning false
2014-01-01 13:32:41,967 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://mybucket/my-30gb-file.lzo.index.tmp' in S3
2014-01-01 13:32:42,017 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '125908'
2014-01-01 13:32:42,227 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '187143'
2014-01-01 13:32:42,516 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '249733'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:34:14,991 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62004474'
2014-01-01 13:34:15,077 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 1000 at pos 61941702 of 39082185217. Read is 0.15865149907767773% done. 
2014-01-01 13:34:15,077 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62067843'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:35:37,849 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '123946504'
2014-01-01 13:35:37,911 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 2000 at pos 123882460 of 39082185217. Read is 0.31714322976768017% done. 
2014-01-01 13:35:37,911 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '124008849'
  ... (repeat of same "Stream for key" message)

我的解决方法

FWIW,我的解决方法是通过 distcp 将文件复制到 HDFS(见下文)。在我看来,这种缓慢似乎是 AWS 可以改进的问题。在下面的作业中,从 S3 复制到 HDFS 需要 17 分钟,而索引只需要 1 分钟。

elastic-mapreduce --create --enable-debugging --alive \
    --ami-version "latest" \
    --log-uri s3n://mybucket/logs/dailyUpdater \
    --name "daily updater test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg s3://mybucket/my-30gb-file.lzo \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Upload input file to HDFS" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Index LZO files" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg hdfs:///my-30gb-file.lzo.index \
      --arg s3://mybucket/my-30gb-file.lzo.index \
      --step-name "Upload index to S3"

最佳答案

在 s3 上的流中查找是作为带有字节范围 header 字段的 GET 实现的。这样的调用需要几百毫秒是非常合理的。由于索引过程似乎需要大量搜索,即使它们都是正向的,您实际上会重新打开文件数千次。

您的解决方法是正确的方法。 S3 针对顺序访问进行了优化,而不是随机访问。

关于amazon-web-services - 当从 S3 读取时,为什么我的 LZO 索引在 Amazon EMR 上需要很长时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20865478/

相关文章:

amazon-web-services - Golang : mocking AWS services which have same method name

python - 使用 s3boto 和 django-storages collecstatic 修改文件

apache-spark - 在EMR从站上运行命令?

amazon-web-services - AWS VPC 识别私有(private)和公有子网

c# - 如何在 C# .NET 核心控制台程序中指定 AWS 凭据

python - 如何使用 boto 库生成临时 url 以将文件上传到 Amazon S3?

angular - TypeError : this. router.events.filter is not a function error whichi uploading angular app to Aws s3 bucket

amazon-web-services - 如果启用 CloudFront,s3 存储桶区域真的很重要吗?

amazon-dynamodb - 批量添加 ttl 列到 dynamodb 表

python - 使用无服务器框架为 AWS Lambda 构建和使用本地包