amazon-web-services - 如何使用 AWS Glue 从 S3 导入 JSON 数据？

标签 amazon-web-services amazon-s3 etl aws-glue

我在 AWS S3 中有一大堆数据以 JSON 格式存储。它看起来像这样:

s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...

都是一样的架构。我想将所有 JSON 数据放入单个数据库表中。我找不到一个很好的教程来解释如何设置它。

理想情况下，我还可以对某些列执行小的“标准化”转换。

我认为 Glue 是正确的选择，但我愿意接受其他选择!

最佳答案

如果您需要使用 Glue 处理数据并且无需在 Glue Catalog 中注册表，则无需运行 Glue Crawler。您可以设置作业并使用 getSourceWithFormat()与 recurse选项设置为 true和 paths指向根文件夹(在您的情况下是 ["s3://my-bucket/"] 或 ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...] )。在工作中，您还可以申请任何所需的 transformations然后将结果写入另一个 S3 bucket, relational DB or a Glue Catalog .

关于amazon-web-services - 如何使用 AWS Glue 从 S3 导入 JSON 数据？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55262557/

上一篇：amazon-web-services - AWS DynamoDB 中的扫描与并行扫描？

下一篇：amazon-web-services - 使用 CDK 部署时，Lambda 无法从外部文件夹中找到模块

相关文章：

sql-server - 为什么foreach循环容器内变量的值为null

amazon-web-services - 配置 S3 url 的 Route 53

ruby-on-rails - 使用 aws-sdk gem 检查 AWS S3 路径上是否存在文件

amazon-web-services - 具有负载均衡器和 Auto Scaling 组的 AWS CloudFormation 只是不断加载网站

java - 如何在AWS lambda函数执行期间在AWS S3存储桶中写入文件？

amazon-web-services - 如何创建公共(public)多区域接入点策略？

python - Django Admin、Amazon S3、Heroku——缺少 icon_calendar

java - Nifi JSON ETL : Custom Transformation Class not found with JoltTransformJSON Processor

amazon-web-services - 以 SQS 队列为目标的 CloudWatch 事件无法正常工作

sql - SSIS SQL 执行任务错误 : unable to run some sql queries