hadoop - Apache Toree 与 Anaconda Jupyter 笔记本

标签 hadoop jupyter-notebook apache-toree

我想寻求与 Anaconda Jupyter 笔记本相关的帮助。我想在 Jupyter Notebook 中编写 PySpark 和 SparkR,并且我遵循了在线教程,该教程教授如何将 Apache Toree 与 Jupyter Notebook 一起安装。

我正在使用 Cloudera Manager Parcels 来管理我的 Kerberized Hadoop 集群。

但是,我无法打开 Apache Toree PySpark 的内核,服务器日志中出现以下错误。

[I 15:24:50.529 NotebookApp] Creating new notebook in
[I 15:24:52.079 NotebookApp] Kernel started: 8cb4838c-2171-4672-96a4-b21ef191ffc6
Starting Spark Kernel with SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p2024.2115/lib/spark) overrides detected (/opt/cloudera/parcels/CDH/lib/spark).
WARNING: Running spark-class from user-defined location.
Exception in thread "main" java.lang.NoSuchMethodError: joptsimple.OptionParser.acceptsAll(Ljava/util/Collection;Ljava/lang/String;)Ljoptsimple/OptionSpecBuilder;
    at org.apache.toree.boot.CommandLineOptions.<init>(CommandLineOptions.scala:37)
    at org.apache.toree.Main$delayedInit$body.apply(Main.scala:25)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at org.apache.toree.Main$.main(Main.scala:24)
    at org.apache.toree.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我已将 jopt-simple-4.5.jar 放入 Toree lib 和 Spark home 中。我是否必须将 jar 放在那里,以便在尝试创建新笔记本时可以找到它?谢谢。

最诚挚的问候, 鲁卡

最佳答案

我发现的最简单的解决方案是将以下选项添加到spark-submit:

--conf "spark.driver.extraClassPath=/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.1.0-incubating.jar" --conf "spark.executor.extraClassPath=/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.1.0-incubating.jar"

这可以添加到 /usr/local/share/jupyter/kernels/apache_toree_scala/kernel.json 文件的 __TOREE_SPARK_OPTS__ 变量中,也可以直接添加到 bash /usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh 文件中的命令。

通过添加此内容,您可以强制类加载器从 Toree JAR 而不是从默认的 CDH 库加载 joptsimple.OptionParser

P。 S. 这是与 CDH 5.10.0 兼容的 Toree 版本:https://github.com/Myllyenko/incubator-toree/releases

关于hadoop - Apache Toree 与 Anaconda Jupyter 笔记本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42898590/

相关文章:

hadoop - 我们如何将多个 hql 文件传递​​给 hive/beeline?

python-3.x - iPython 笔记本中的动画

python - 如何让 Jupyter Notebook 内核在 Kubernetes 中抛出内存不足错误

python - 为 Jupyter (Anaconda) 安装 Scala 内核(或 Spark/Toree)

hadoop - 将配置单元分区表加载到Spark Dataframe

java - 比较mapreduce中的三个字段

hadoop - 如何在hadoop集群中备份datanode

python - 在 jupyter/ipython notebook 中设置优先级/niceness

scala - 单元格宽度 Jupyter 笔记本 - Apache Toree - Scala

apache-spark - 在 toree 中安装 spark 包