python - 设置 PYSPARK_SUBMIT_ARGS 后 PySpark 在 Jupyter 中失败

我正在尝试在 Jupyter 笔记本中加载 Spark (2.2.1) 包，否则 Spark 可以正常运行。一旦我添加

%env PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'

我在尝试创建上下文时收到此错误:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-b25d0ed9494e> in <module>()
----> 1 sc = SparkContext.getOrCreate()
      2 sql_context = SQLContext(sc)

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in getOrCreate(cls, conf)
    332         with SparkContext._lock:
    333             if SparkContext._active_spark_context is None:
--> 334                 SparkContext(conf=conf or SparkConf())
    335             return SparkContext._active_spark_context
    336 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    281         with SparkContext._lock:
    282             if not SparkContext._gateway:
--> 283                 SparkContext._gateway = gateway or launch_gateway(conf)
    284                 SparkContext._jvm = SparkContext._gateway.jvm
    285 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/java_gateway.py in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number

同样，只要未设置PYSPARK_SUBMIT_ARGS(或仅设置为pyspark-shell)，一切都会正常工作。一旦我添加其他任何内容(例如，如果我将其设置为 --master local pyspark-shell)，我就会收到此错误。在谷歌上搜索后，大多数人建议简单地删除 PYSPARK_SUBMIT_ARGS ，但出于明显的原因我不能这样做。

我也尝试过设置我的 JAVA_HOME ，尽管我不明白为什么这会产生影响，因为 Spark 可以在没有该环境变量的情况下工作。我使用 spark-submit 和 pyspark 在 Jupyter 外部传递的参数。

我想我的第一个问题是，有没有办法获得更详细的错误消息？某个地方有日志文件吗？当前的消息实际上没有告诉我任何信息。

最佳答案

在初始化 SparkContext 之前按如下方式设置 PYSPARK_SUBMIT_ARGS:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'

关于python - 设置 PYSPARK_SUBMIT_ARGS 后 PySpark 在 Jupyter 中失败，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49020056/

python - 设置 PYSPARK_SUBMIT_ARGS 后 PySpark 在 Jupyter 中失败

上一篇：python - 一对一关系数据库模型不起作用

下一篇：python - 使用 python install 命令出现超时消息