python - Spark : Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, Action ，或转换

标签 python apache-spark pyspark

Class ProdsTransformer:

    def __init__(self):  
      self.products_lookup_hmap = {}
      self.broadcast_products_lookup_map = None

    def create_broadcast_variables(self):
      self.broadcast_products_lookup_map = sc.broadcast(self.products_lookup_hmap)

    def create_lookup_maps(self):
    // The code here builds the hashmap that maps Prod_ID to another space.

pt = ProdsTransformer ()
pt.create_broadcast_variables()  

pairs = distinct_users_projected.map(lambda x: (x.user_id,    
                         pt.broadcast_products_lookup_map.value[x.Prod_ID]))

我收到以下错误:

"Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."

任何有关如何处理广播变量的帮助都会很棒!

最佳答案

通过在 map lambda 中引用包含广播变量的对象，Spark 将尝试序列化整个对象并将其发送给工作人员。由于该对象包含对 SparkContext 的引用，因此您会收到错误。而不是这个:

pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID]))

试试这个:

bcast = pt.broadcast_products_lookup_map
pairs = distinct_users_projected.map(lambda x: (x.user_id, bcast.value[x.Prod_ID]))

后者避免了对对象的引用(pt)，这样Spark只需要传送广播变量。

关于python - Spark : Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, Action ，或转换，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31508689/

上一篇： python Pandas : detecting frequency of time series

下一篇：python - 在字典中删除重复列表的最快方法

python - 在文本文件的某些特殊行前面添加标签

python - 使用 python 从文本文件中读取 - 第一行被遗漏

scala - Spark - 对一列进行分组并查找其他列的平均值

python - 结构化流 Kafka 2.1->Zeppelin 0.8->Spark 2.4 : spark does not use jar

python - 使用 selenium 单击多个链接

apache-spark - 如何将一个 RDD 拆分为两个或多个 RDD？

apache-spark - Spark 2.4.3 使用哪个 Scala 版本？

apache-spark - 无法推断 Parquet 的架构。必须手动指定

python - 无法推断类型 : <type 'str' > 的架构