pyspark - pickle.PicklingError : Cannot pickle files that are not opened for reading

标签 pyspark pickle google-cloud-dataproc

我在 dataproc 上运行 pyspark 作业时遇到此错误。可能是什么原因 ?

这是错误的堆栈跟踪。

  File "/usr/lib/python2.7/pickle.py", line 331, in save
  self.save_reduce(obj=obj, *rv)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", 
  line 553, in save_reduce
  File "/usr/lib/python2.7/pickle.py", line 286, in save
  f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
  self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
  save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
  f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", 
  line 582, in save_file
  pickle.PicklingError: Cannot pickle files that are not opened for reading

最佳答案

我发现了问题。我在 Map 函数中使用了字典。
失败的原因:工作节点无法访问我在 map 函数中传递的字典。

解决方案 :

I broadcasted the dictionary and then used it in function (Map)
sc =  SparkContext()
lookup_bc = sc.broadcast(lookup_dict)

然后在函数中,我通过使用这个来获取值(value):
data = lookup_bc.value.get(key)

希望能帮助到你 !

关于pyspark - pickle.PicklingError : Cannot pickle files that are not opened for reading,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43977279/

相关文章:

python - 尝试创建同一类的多个实例时,无法 pickle _tkinter.tkapp 对象错误

python - 为什么不能 pickle Ellipsis 和 NotImplemented?

python - copy.deepcopy 与 pickle

google-cloud-storage - Dataproc 导入存储在谷歌云存储 (gcs) 存储桶中的 python 模块

python-3.x - 由于 python 版本,运行 PySpark DataProc 作业时出错

python-3.x - 使用pyspark从元组列表创建DataFrame

hadoop - 检查点在 Apache Spark 上有什么作用?

amazon-web-services - AWS Glue 无法访问输入数据集

apache-spark - Spark 2.0 : Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI

apache-spark - DataProc集群Spark作业提交无法启动NodeManager