amazon-s3 - Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢

标签 amazon-s3 hive amazon-dynamodb amazon-emr

我需要将大约 1.3 亿个项目(总共 5+ Gb)初始上传到单个 DynamoDB 表中。在我面对problems之后通过使用我的应用程序中的 API 上传它们,我决定尝试使用 EMR。

长话短说,即使在最强大的集群上,导入非常平均(对于 EMR)的数据量也需要很长时间,花费数百小时而几乎没有进展(大约 20 分钟来处理测试 2Mb 数据位,并且没有管理在 12 小时内完成测试 700Mb 文件)。

我已经联系了亚马逊高级支持,但到目前为止他们只告诉“由于某种原因 DynamoDB 导入速度很慢”。

我在交互式配置单元 session 中尝试了以下说明:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

各种标志似乎没有任何明显的效果。尝试了以下设置而不是默认设置:
SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

为 HDFS 而不是 DynamoDB 目标运行的相同命令在几秒钟内完成。

这似乎是一个简单的任务,一个非常基本的用例,我真的想知道我在这里做错了什么。

最佳答案

这是我最近从 AWS 支持那里得到的答案。希望对遇到类似情况的人有所帮助:

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.

关于amazon-s3 - Amazon Elastic MapReduce - 从 S3 到 DynamoDB 的大量插入非常慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10683136/

相关文章:

amazon-web-services - 创建不使用亚马逊S3存储桶的亚马逊cloudfront发行版

ios - 带有进度 block 的 Amazon S3 iOS SDK 后台上传

amazon-web-services - 如何从 Amazon S3 文件重定向到我的网站?

elasticsearch - 如何使用Logstash将数据从AWS dynamo DB加载到Elasticsearch

python - 如何使用 aws Lambda 和 python 将项目放入 aws DynamoDb

amazon-s3 - 为什么托管静态网站时 Amazon S3 存储桶名称必须与网站名称相同

hadoop - 选择配置单元执行引擎

hadoop - FALSE或NULL在Hive中不起作用

hadoop - 使用 Java 连接到 Hadoop

typescript - 如何将 DynamoDB AttributeMap 类型映射到接口(interface)?