amazon-web-services - Amazon Athena不会解析Cloudfront日志

标签 amazon-web-services amazon-athena

我正在关注Athena getting started guide并尝试解析自己的Cloudfront日志。但是,没有解析这些字段。

我使用了一个小的测试文件,如下所示:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02  07:57:45    LHR5    5001    86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /foo    404 -   Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Error   -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA==    d3g47gpj5mj0b.cloudfront.net    https421    0.076   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02  07:57:45    LHR5    1158241 86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo    Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Miss    oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA==    d3g47gpj5mj0b.cloudfront.net    https   419 0.440   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss

并使用此SQL创建表:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
  ) LOCATION 's3://test/athena-csv/'

但是没有数据返回:

athena screen shot with no data

我可以看到它返回了4行,但是前2行应排除在外,因为它们以#开头,因此就像正则表达式未正确解析一样。

难道我做错了什么?还是正则表达式是错误的(似乎不太可能,因为它在文档中,并且对我来说很好)?

最佳答案

这就是我最终得到的结果:

CREATE EXTERNAL TABLE logs (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)'
) LOCATION 's3://logs'

请注意,双反斜杠是有意的。

Cloudfront日志的格式在某些时候进行了更改,以添加protocol。这可以处理旧文件和新文件。

关于amazon-web-services - Amazon Athena不会解析Cloudfront日志,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42601590/

相关文章:

amazon-web-services - AWS : Automating queries in redshift

hadoop - 如何查找我的 hadoop 集群的集群 ID?

python - 如何覆盖 AWS Athena 表

amazon-web-services - AWS Elastic Beanstalk - 在环境之间切换环境变量

ruby-on-rails - 雾警告和 AWS : unable to load the 'unf' gem

git - AWS CodeDeploy 找不到私有(private)存储库 "could not download bundle"的 github tar 链接

amazon-redshift - 以Parquet格式将数据文件从Amazon Redshift卸载到Amazon S3

sql - 新时间戳上的分区日期以获取上一个时间戳

amazon-web-services - 如何解决 AWS 爬虫在双引号内拆分逗号并破坏数据目录的问题?

amazon-web-services - Boto3雅典娜查询而无需将数据保存到S3