amazon-web-services - 从胶水Cloudformation模板对Athena表进行分区

标签 amazon-web-services partitioning amazon-athena aws-glue

使用AWS::Glue::Table,您可以设置像here这样的Athena表。 Athena supports partitioning data基于S3中的文件夹结构。我想从我的Glue模板对Athena表进行分区。

AWS Glue Table TableInput看来,我可以使用PartitionKeys对数据进行分区,但是当我尝试使用以下模板时,Athena失败并且无法获取任何数据。

Resources:
  ...

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      DatabaseName: !Ref MyGlueDatabase
      CatalogId: !Ref AWS::AccountId
      TableInput:
        Name: my-glue-table
        Parameters: { "classification" : "json" }
        PartitionKeys:
          - {Name: dt, Type: string}
        StorageDescriptor:
          Location: "s3://elasticmapreduce/samples/hive-ads/tables/impressions/"
          InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
          OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
          SerdeInfo:
            Parameters: { "separatorChar" : "," }
            SerializationLibrary: "org.apache.hive.hcatalog.data.JsonSerDe"
          StoredAsSubDirectories: false
          Columns:
            - {Name: requestBeginTime, Type: string}
            - {Name: adId, Type: string}
            - {Name: impressionId, Type: string}
            - {Name: referrer, Type: string}
            - {Name: userAgent, Type: string}
            - {Name: userCookie, Type: string}
            - {Name: ip, Type: string}
            - {Name: number, Type: string}
            - {Name: processId, Type: string}
            - {Name: browserCookie, Type: string}
            - {Name: requestEndTime, Type: string}
            - {Name: timers, Type: "struct<modellookup:string,requesttime:string>"}
            - {Name: threadId, Type: string}
            - {Name: hostname, Type: string}
            - {Name: sessionId, Type: string}


如何在AWS Glue中对数据进行分区?

最佳答案

得到它了!非常痛苦,因为我必须运行Glue Crawler,它会正确创建带有分区的表,然后使用CLI提取正确的模板参数。这是模板,

AWSTemplateFormatVersion: 2010-09-09
Description: A partitioned Glue Table

Resources:
  MyGlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      DatabaseInput:
        Name: my_glue_database
        Description: "Glue beats tape"
      CatalogId: !Ref AWS::AccountId

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      DatabaseName: !Ref MyGlueDatabase
      CatalogId: !Ref AWS::AccountId
      TableInput:
        Name: my_glue_table
        TableType: EXTERNAL_TABLE
        Parameters:
          CrawlerSchemaDeserializerVersion': "1.0"
          CrawlerSchemaSerializerVersion': "1.0"
          classification': json
          compressionType': none
          typeOfData': file
        PartitionKeys:
          - {Name: dt, Type: string}
        StorageDescriptor:
          BucketColumns: []
          Columns:
          - {Name: number, Type: string}
          - {Name: referrer, Type: string}
          - {Name: processid, Type: string}
          - {Name: adid, Type: string}
          - {Name: browsercookie, Type: string}
          - {Name: usercookie, Type: string}
          - {Name: requestendtime, Type: string}
          - {Name: impressionid, Type: string}
          - {Name: useragent, Type: string}
          - {Name: timers, Type: 'struct<modelLookup:string,requestTime:string>'}
          - {Name: threadid, Type: string}
          - {Name: ip, Type: string}
          - {Name: modelid, Type: string}
          - {Name: hostname, Type: string}
          - {Name: sessionid, Type: string}
          - {Name: requestbegintime, Type: string}
          Compressed: false
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://elasticmapreduce/samples/hive-ads/tables/impressions/
          NumberOfBuckets: -1
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Parameters: {CrawlerSchemaDeserializerVersion: '1.0', CrawlerSchemaSerializerVersion: '1.0',
            UPDATED_BY_CRAWLER: test, averageRecordSize: '644', classification: json,
            compressionType: none, objectCount: '241', recordCount: '1000109', sizeKey: '648533598',
            typeOfData: file}
          SerdeInfo:
            Parameters: {paths: 'adId,browserCookie,hostname,impressionId,ip,modelId,number,processId,referrer,requestBeginTime,requestEndTime,sessionId,threadId,timers,userAgent,userCookie'}
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
          SortColumns: []
          StoredAsSubDirectories: false


然后,Cloudformation将部署该表,并且您需要运行。

MSCK REPAIR TABLE my_glue_table;


这将添加所有分区,您将在输出中看到以下内容,

Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-00
Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-05


然后,您可以在该分区之上运行SQL,例如,

%% SELECT * FROM "my_glue_database"."my_glue_table" WHERE dt = '2009-04-14-13-00' LIMIT 10;
1   7663    cartoonnetwork.com  1178    SxRBJCmJBCLcfTS545t6qD1M8L64SC  nsdfvfvger  3VCLfFfF75BDgHgDoowHegOpkCivMJ  1239714024000   RTM6Vtrc1O3KX2FlUghUSiAQHiix8F  Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0) {modellookup=0.3538, requesttime=0.7532}    15  37.215.88.35    bxxiuxduad  ec2-50-32-48-14.amazon.com  BIBIlA7dgXc2eWekUJ6hSXa7p6dQEx  1239714024000   2009-04-14-13-00
2   17646   coursera.org    1255    Fskm4W6JKX6vf7UMaW55KObTJCtm1E  xftjotkexc  jH6DRWtkeH3tVg6c4mcLW36UW3LvqX  1239714027000   uQqO1fNoeM8KdesiVg86o4iK7FkqLt  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322)  {modellookup=0.2986, requesttime=0.9616}    21  37.218.101.204  bxxiuxduad  ec2-50-32-48-14.amazon.com  OjgTQWOqHJopoWf9LpJ4We1UE7uJao  1239714026000   2009-04-14-13-00

关于amazon-web-services - 从胶水Cloudformation模板对Athena表进行分区,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50279209/

相关文章:

python - 从 Lambda 函数查询 Athena - QUEUED 状态?

mysql - 表分区和 mysql ndb 集群

amazon-web-services - 如何使用 node.js 和 aws-sdk 从 lambda 访问 aws 参数存储

amazon-web-services - CloudFormation 动态引用不同区域的 Secret

amazon-web-services - Keycloak 服务器管理控制台阻止使用 Istio 网关和 AWS HTTPS 应用程序负载均衡器的 AWS K3S Kubernetes 集群上的混合内容响应

sorting - 在某些条件下快速排序中分区中的最小部分?

Python 对排序列表重新排序,使最高值位于中间

amazon-web-services - 如何从 AWS 中的 Athena 检查分区列表?

presto - Amazon Athena/Presto中的时差(秒和分钟)

amazon-web-services - 在 AWS 应用程序负载均衡器后面添加启用 x-pack 的 Elasticsearch 导致由于缺少身份验证而导致运行状况检查失败