amazon-web-services - 当从 S3 读取时，为什么我的 LZO 索引在 Amazon EMR 上需要很长时间？

我在 S3 上有一个 30GB lzo 文件，我正在使用 hadoop-lzo 通过 Amazon EMR (AMI v2.4.2) 并使用区域 us-east1 为其建立索引。

elastic-mapreduce --create --enable-debugging \
    --ami-version "latest" \
    --log-uri s3n://mybucket/mylogs \
    --name "lzo index test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg s3://mybucket/my-30gb-file.lzo \
      --step-name "Index LZO files"

1% 的进度大约需要 10 分钟，因此一个文件完成大约需要 16 小时。进度显示仅读取了 80mb。

相比之下，使用相同的集群(当上述作业正在运行时)，我可以将文件从 S3 复制到本地硬盘，然后复制到 HDFS，最后运行索引器，总时间约为 10 分钟。同样，我的本地集群可以在大约 7 分钟内处理此问题。

过去，我相信我直接在 S3 上运行 LZO 索引，没有出现这么严重的延迟，尽管它是在早期的 AMI 版本上。我不知道我使用的是什么 AMI，因为我总是使用“最新”。 (更新:我尝试了 AMI v2.2.4，结果相同，所以可能是我记错了，或者是其他原因导致速度缓慢)

有什么想法会发生什么吗？

这是步骤日志输出的副本:

Task Logs: 'attempt_201401011330_0001_m_000000_0'


stdout logs



stderr logs



syslog logs

2014-01-01 13:32:39,764 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2014-01-01 13:32:40,043 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2014-01-01 13:32:40,120 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-7-132-249.ec2.internal
2014-01-01 13:32:40,134 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2014-01-01 13:32:40,138 INFO org.apache.hadoop.mapred.Task (main):  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c785f0b
2014-01-01 13:32:40,943 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2014-01-01 13:32:41,104 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2014-01-01 13:32:41,104 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2014-01-01 13:32:41,121 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2014-01-01 13:32:41,121 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2014-01-01 13:32:41,314 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://mybucket/my-30gb-file.lzo' for reading
2014-01-01 13:32:41,478 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '63624'
2014-01-01 13:32:41,773 INFO com.hadoop.mapreduce.LzoIndexRecordWriter (main): Setting up output stream to write index file for s3://mybucket/my-30gb-file.lzo
2014-01-01 13:32:41,885 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index.tmp' but file does not exist, so returning false
2014-01-01 13:32:41,928 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index' but file does not exist, so returning false
2014-01-01 13:32:41,967 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://mybucket/my-30gb-file.lzo.index.tmp' in S3
2014-01-01 13:32:42,017 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '125908'
2014-01-01 13:32:42,227 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '187143'
2014-01-01 13:32:42,516 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '249733'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:34:14,991 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62004474'
2014-01-01 13:34:15,077 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 1000 at pos 61941702 of 39082185217. Read is 0.15865149907767773% done. 
2014-01-01 13:34:15,077 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62067843'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:35:37,849 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '123946504'
2014-01-01 13:35:37,911 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 2000 at pos 123882460 of 39082185217. Read is 0.31714322976768017% done. 
2014-01-01 13:35:37,911 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '124008849'
  ... (repeat of same "Stream for key" message)

我的解决方法

FWIW，我的解决方法是通过 distcp 将文件复制到 HDFS(见下文)。在我看来，这种缓慢似乎是 AWS 可以改进的问题。在下面的作业中，从 S3 复制到 HDFS 需要 17 分钟，而索引只需要 1 分钟。

elastic-mapreduce --create --enable-debugging --alive \
    --ami-version "latest" \
    --log-uri s3n://mybucket/logs/dailyUpdater \
    --name "daily updater test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg s3://mybucket/my-30gb-file.lzo \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Upload input file to HDFS" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Index LZO files" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg hdfs:///my-30gb-file.lzo.index \
      --arg s3://mybucket/my-30gb-file.lzo.index \
      --step-name "Upload index to S3"

最佳答案

在 s3 上的流中查找是作为带有字节范围 header 字段的 GET 实现的。这样的调用需要几百毫秒是非常合理的。由于索引过程似乎需要大量搜索，即使它们都是正向的，您实际上会重新打开文件数千次。

您的解决方法是正确的方法。 S3 针对顺序访问进行了优化，而不是随机访问。

关于amazon-web-services - 当从 S3 读取时，为什么我的 LZO 索引在 Amazon EMR 上需要很长时间？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20865478/

amazon-web-services - 当从 S3 读取时，为什么我的 LZO 索引在 Amazon EMR 上需要很长时间？

我的解决方法

上一篇：scala - 在函数式编程中查找列表列表中元素的位置

下一篇：php - WordPress php函数没有返回值