apache-spark - 如何在spark作业中使用HiveContext添加jar

标签 apache-spark spark-streaming apache-spark-sql

我正在尝试添加 JSONSerDe jar 文件,以便访问 json 数据,将 JSON 数据从 Spark 作业加载到 hive 表。我的代码如下所示:

SparkConf  sparkConf = new SparkConf().setAppName("KafkaStreamToHbase");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10));
        final SQLContext sqlContext = new SQLContext(sc);
        final HiveContext hiveContext = new HiveContext(sc);
hiveContext.sql("ADD JAR hdfs://localhost:8020/tmp/hive-serdes-1.0-SNAPSHOT.jar");

                hiveContext.sql("LOAD DATA INPATH '/tmp/mar08/part-00000' OVERWRITE INTO TABLE testjson");

但我最终出现以下错误:

java.net.MalformedURLException: unknown protocol: hdfs
        at java.net.URL.<init>(URL.java:592)
        at java.net.URL.<init>(URL.java:482)
        at java.net.URL.<init>(URL.java:431)
        at java.net.URI.toURL(URI.java:1096)
        at org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
        at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
        at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
        at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:148)
        at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:141)
        at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
        at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
        at scala.util.Try$.apply(Try.scala:161)

我能够通过 hive shell 添加 jar。但是当我尝试在 Spark 作业(javacode)中使用 hiveContext.sql() 添加时,它会抛出错误。快速帮助将非常有帮助。

谢谢。

最佳答案

一种解决方法是,您可以在运行时通过将 --jars 传递给 Spark-submit 命令来传递 udf jar,或者您可以将这些所需的 jar 复制到 Spark 库。

基本上它支持file、hdfs和ivy方案。

您使用的是哪个版本的 Spark。我在最新版本的 ClientWrapper.scala 中看不到 addJar 方法。

关于apache-spark - 如何在spark作业中使用HiveContext添加jar,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37814749/

相关文章:

apache-spark - 如何在 Spark 结构化流中使用流数据帧更新静态数据帧

scala - 我们如何从结构化流中获得小批量时间

hadoop - yarn : Automatic clearing of filecache & usercache

scala - 如何在 Spark/Scala 中使用爆炸

apache-spark - 如何在 pyspark 中聚合数组内的值?

apache-spark - Spark SQL 窗口超过两个指定时间边界之间的间隔 - 3 小时到 2 小时前

java - 在 Spring Boot 可执行 jar 中包含 Hortonworks 存储库

ssh - 如何在独立环境中启动 apache-spark 从属实例?

java - Spark 结构化流中的 MQ 源

java - Spark 将 JSON 字符串转换为 JSON 对象 (Java)