python-3.x - 由于 python 版本，运行 PySpark DataProc 作业时出错

标签 python-3.x apache-spark google-cloud-dataproc

我使用以下命令创建了一个 dataproc 集群

gcloud dataproc clusters create datascience \
--initialization-actions \
    gs://dataproc-initialization-actions/jupyter/jupyter.sh \

但是，当我提交 PySpark 作业时，出现以下错误

Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

有什么想法吗？

最佳答案

这是由于 master 和 worker 之间的 python 版本不同。默认情况下，jupyter image安装最新版本的 miniconda，它使用 python3.7。但是，worker 仍然使用默认的python3.6。

解决方案:
- 创建主节点时指定 miniconda 版本，即在主节点中安装 python3.6

gcloud dataproc clusters create example-cluster --metadata=MINICONDA_VERSION=4.3.30

笔记:

可能需要更新以获得更可持续的环境管理解决方案

关于python-3.x - 由于 python 版本，运行 PySpark DataProc 作业时出错，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51427175/

上一篇：jsf - 我应该使用 URL 作为图像和其他资源的链接，还是应该使用 EL #{resource...}

下一篇：r - 使用 foreach 写入 R 中的单个公共(public)文件

java - 在 java 中使用 Spark 读取 Avro

hadoop - 使用哪种 FileInputFormat 读取 Hadoop 存档文件 (HAR) 文件

dask - 使用现有的 dataproc 集群运行 dask

python - 列表理解中的字符串格式

python - 在 Python 应用程序中调用自定义 C 子例程

python - 为什么我不能在 python3 中子类化元组？

python - 如何通过python连接 "Tally ODBC"？

google-cloud-dataproc - 修复了 Dataproc 组件网关的主机名

java - Google Storage API 中的死锁