我无法使用带有Hive的大象鸟4.14反序列化已在其中重复字符串的protobuf数据。这似乎是因为重复字符串功能仅在Protobuf 2.6中可用,而在Protobuf 2.5中不可用。在AWS EMR集群中运行我的配置单元查询时,它使用与AWS Hive bundle 在一起的Protobuf 2.5。即使在显式添加Protobuf 2.6 jar之后,我也无法摆脱此错误。我想知道如何使配置单元使用我明确添加的Protobuf 2.6 jar。
以下是使用的配置单元查询:
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
drop table GamRelationRodb;
CREATE EXTERNAL TABLE GamRelationRodb
row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
with serdeproperties("serialization.class"=
"com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES
("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');
select * from GamRelationRodb limit 10;
以下是Protobuf文件的格式:
message RepeatedRelationshipWrapper {
repeated relationship.Relationship relationships = 1;
}
message Relationship {
required RelationshipType type = 1;
repeated string ids = 2;
}
enum RelationshipType {
UKNOWN_RELATIONSHIP_TYPE = 0;
PARENT = 1;
CHILD = 2;
}
下面是运行查询时抛出的运行时异常:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
最佳答案
Protobuf是脆弱的库。它可能在2.x版本之间是线格式兼容的,但是protoc生成的类将仅链接到与protoc编译器版本完全相同的protobuf JAR。
从根本上讲,这意味着您无法更新protobuf,除非对所有依赖项进行了编排。 Great Protobuf upgrade in 2013是在Hadoop,Hbase,Hive&c升级之后以及之后的:每个人都冻结在v 2.5上,这可能是整个Hadoop 2.x代码行的生命,除非全部消失或Java 9掩盖了问题。
我们比起升级到Guava和Jackson更害怕protobuf更新,因为后者只会破坏每个单独的库,而不是有线格式。
观看HADOOP-13363中有关2.x升级的主题,并观看HDFS-11010中有关在hadoop主干中升级到protobuf 3的问题。这很麻烦,因为它确实改变了线路格式,protobuf-json编码中断等。
最好的结论是,“发现protobuf代码缺乏二进制兼容性”,并坚持使用protobuf 2.5。抱歉。
您可以获取要使用的整个库堆栈,并使用更新的protoc编译器(与protobuf.jor匹配)以及需要应用的任何其他修补程序来重建它们。我只建议大胆建议,但对结果感到好奇。如果您尝试这样做,请告诉我们如何解决
进一步阅读fear of dependencies
关于amazon-web-services - 无法在AWS中使用Elephant-Bird和Hive对Protobuf(2.6.1)数据进行反序列化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43140288/