avro - Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现

标签 avro emr parquet druid

上下文:

我能够从 druid overlord 向 EMR 提交 MapReduce 作业。我的数据源是 Parquet 格式的 S3。我在 Parquet 数据中有一个时间戳列 (INT96),Avroschema 不支持它。

解析时间戳时出错

问题堆栈跟踪是:

Error: java.lang.IllegalArgumentException: INT96 not yet implemented.
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)

环境:

Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3

德鲁伊输入json

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "s3://s3_path"
      }
    },
    "dataSchema": {
      "dataSource": "parquet_test1",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2017-08-01T00:00:00/2017-08-02T00:00:00"]
      },
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "t",
            "format": "yyyy-MM-dd HH:mm:ss:SSS zzz"            
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1","dim2","dim3"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      },{
          "type" : "count",
          "name" : "pid",
          "fieldName" : "pid"
        }]
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {
        "mapreduce.job.user.classpath.first": "true",
        "fs.s3.awsAccessKeyId" : "KEYID",
        "fs.s3.awsSecretAccessKey" : "AccessKey",
        "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "fs.s3n.awsAccessKeyId" : "KEYID",
        "fs.s3n.awsSecretAccessKey" : "AccessKey",
        "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
      },
      "leaveIntermediate": true
    }
  }, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}

可能的解决方案

 1. Save the data in parquet efficiently instead of transforming in Avro to remove the dependencies.

 2. Fixing AvroSchema to support INT96 timestamp format of Parquet.

最佳答案

0.17.0 及更高版本 中的 Druid 使用 Parquet Hadoop Parser 支持 Parquet INT96 类型。

The Parquet Hadoop Parser supports int96 Parquet values, while the Parquet Avro Hadoop Parser does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#parquet-hadoop-parser

关于avro - Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48366196/

相关文章:

python-3.x - 将 avro 文件的目录从 HDFS 读取到 python 中的类似数据框的对象中

amazon-s3 - 如何将文件从 S3 复制到 Amazon EMR HDFS?

hadoop - 我可以从AWS Elastic Mapreduce作业访问Zookeeper吗

apache-spark - Spark中区分大小写的拼花模式合并

c# - 如何在 C# 中为泛型类型创建 Avro 模式?

python - 使用 beam、python 读取具有 Avro 模式的 Big Query 表

amazon-web-services - 在 emr 上重启 hiveserver2

apache-spark - Spark magic 输出提交器设置无法识别

apache-spark - Spark 数据集缓存仅使用一个执行器

text-files - Impala - 将现有表格转换为 Parquet 格式