上下文:
我能够从 druid overlord 向 EMR 提交 MapReduce 作业。我的数据源是 Parquet 格式的 S3。我在 Parquet 数据中有一个时间戳列 (INT96),Avroschema 不支持它。
解析时间戳时出错
问题堆栈跟踪是:
Error: java.lang.IllegalArgumentException: INT96 not yet implemented.
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)
环境:
Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3
德鲁伊输入json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "s3://s3_path"
}
},
"dataSchema": {
"dataSource": "parquet_test1",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2017-08-01T00:00:00/2017-08-02T00:00:00"]
},
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "t",
"format": "yyyy-MM-dd HH:mm:ss:SSS zzz"
},
"dimensionsSpec": {
"dimensions": [
"dim1","dim2","dim3"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
},{
"type" : "count",
"name" : "pid",
"fieldName" : "pid"
}]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"targetPartitionSize": 5000000
},
"jobProperties" : {
"mapreduce.job.user.classpath.first": "true",
"fs.s3.awsAccessKeyId" : "KEYID",
"fs.s3.awsSecretAccessKey" : "AccessKey",
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId" : "KEYID",
"fs.s3n.awsSecretAccessKey" : "AccessKey",
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
},
"leaveIntermediate": true
}
}, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}
可能的解决方案
1. Save the data in parquet efficiently instead of transforming in Avro to remove the dependencies.
2. Fixing AvroSchema to support INT96 timestamp format of Parquet.
最佳答案
0.17.0 及更高版本 中的 Druid 使用 Parquet Hadoop Parser 支持 Parquet INT96 类型。
The Parquet Hadoop Parser supports int96 Parquet values, while the Parquet Avro Hadoop Parser does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.
https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#parquet-hadoop-parser
关于avro - Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48366196/