amazon-s3 - Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢

我需要将大约 1.3 亿个项目(总共 5+ Gb)初始上传到单个 DynamoDB 表中。在我面对problems之后通过使用我的应用程序中的 API 上传它们，我决定尝试使用 EMR。

长话短说，即使在最强大的集群上，导入非常平均(对于 EMR)的数据量也需要很长时间，花费数百小时而几乎没有进展(大约 20 分钟来处理测试 2Mb 数据位，并且没有管理在 12 小时内完成测试 700Mb 文件)。

我已经联系了亚马逊高级支持，但到目前为止他们只告诉“由于某种原因 DynamoDB 导入速度很慢”。

我在交互式配置单元 session 中尝试了以下说明:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎没有任何明显的效果。尝试了以下设置而不是默认设置:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

为 HDFS 而不是 DynamoDB 目标运行的相同命令在几秒钟内完成。

这似乎是一个简单的任务，一个非常基本的用例，我真的想知道我在这里做错了什么。

最佳答案

这是我最近从 AWS 支持那里得到的答案。希望对遇到类似情况的人有所帮助:

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.

关于amazon-s3 - Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10683136/

amazon-s3 - Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢

上一篇：haskell - 如何在 Haskell 中编写 Data.Vector.Unboxed 实例？

下一篇：powershell - 在 PowerShell 中删除最旧版本的目录