amazon-web-services - 无法在AWS中使用Elephant-Bird和Hive对Protobuf(2.6.1)数据进行反序列化

标签 amazon-web-services hadoop hive protocol-buffers elephantbird

我无法使用带有Hive的大象鸟4.14反序列化已在其中重复字符串的protobuf数据。这似乎是因为重复字符串功能仅在Protobuf 2.6中可用,而在Protobuf 2.5中不可用。在AWS EMR集群中运行我的配置单元查询时,它使用与AWS Hive bundle 在一起的Protobuf 2.5。即使在显式添加Protobuf 2.6 jar之后,我也无法摆脱此错误。我想知道如何使配置单元使用我明确添加的Protobuf 2.6 jar。

以下是使用的配置单元查询:

    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
    add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
    drop table GamRelationRodb;

    CREATE EXTERNAL TABLE GamRelationRodb
    row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
    with serdeproperties("serialization.class"= 
 "com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
    STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES 
    ("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');

    select * from GamRelationRodb limit 10;

以下是Protobuf文件的格式:
message RepeatedRelationshipWrapper { 
    repeated relationship.Relationship relationships = 1;
}

message Relationship {
    required RelationshipType type = 1;
    repeated string ids = 2;
}

enum RelationshipType {
    UKNOWN_RELATIONSHIP_TYPE = 0;
    PARENT = 1;
    CHILD = 2;
}

下面是运行查询时抛出的运行时异常:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
    at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
    at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
    at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

最佳答案

Protobuf是脆弱的库。它可能在2.x版本之间是线格式兼容的,但是protoc生成的类将仅链接到与protoc编译器版本完全相同的protobuf JAR。

从根本上讲,这意味着您无法更新protobuf,除非对所有依赖项进行了编排。 Great Protobuf upgrade in 2013是在Hadoop,Hbase,Hive&c升级之后以及之后的:每个人都冻结在v 2.5上,这可能是整个Hadoop 2.x代码行的生命,除非全部消失或Java 9掩盖了问题。

我们比起升级到Guava和Jackson更害怕protobuf更新,因为后者只会破坏每个单独的库,而不是有线格式。

观看HADOOP-13363中有关2.x升级的主题,并观看HDFS-11010中有关在hadoop主干中升级到protobuf 3的问题。这很麻烦,因为它确实改变了线路格式,protobuf-json编码中断等。

最好的结论是,“发现protobuf代码缺乏二进制兼容性”,并坚持使用protobuf 2.5。抱歉。

您可以获取要使用的整个库堆栈,并使用更新的protoc编译器(与protobuf.jor匹配)以及需要应用的任何其他修补程序来重建它们。我只建议大胆建议,但对结果感到好奇。如果您尝试这样做,请告诉我们如何解决

进一步阅读fear of dependencies

关于amazon-web-services - 无法在AWS中使用Elephant-Bird和Hive对Protobuf(2.6.1)数据进行反序列化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43140288/

相关文章:

amazon-web-services - CIDR 无效 AWS Cloud Formation

amazon-web-services - 是否有可用于 AWS S3 存储桶名称的策略变量?

hadoop - yarn.log.dir 在哪里定义的?

sql - 如何使用rank函数获取hive中的最新记录

hadoop - 在以下使用Hadoop生态系统的用例中,哪种方法最好?

mysql - Ubuntu 20.04 服务器(AWS EC2)上基于 Docker + MySQL 8.0.26 的项目的 lower_case_table_names 错误

amazon-web-services - 如何在创建 yml 的同时在 AWS lambda 中创建新的 IAM::ROLE

hadoop - 在 Hadoop 文件系统中复制本地文件

hadoop - 即使查询中存在分区谓词,也找不到 Alias 的分区谓词

hadoop - 无效的表别名或列引用 'SYNTHJOIN_xxxxx'