python - Pyspark UDF无法使用大字典

标签 python dictionary pyspark user-defined-functions amazon-emr

我有一个字典,其中键 = 单词,值 = 300 个 float 的数组。 我无法在我的 pyspark UDF 中使用该字典。 当该字典的大小为 200 万个键时,它不起作用。但是当我将大小减小到 200K 时,它就可以工作了。

这是我将函数转换为 UDF 的代码

def get_sentence_vector(sentence, dictionary_containing_word_vectors):
     cleanedSentence = list(clean_text(sentence))  
     words_vector_list = np.zeros(300)# 300 dimensional vector
     for x in cleanedSentence:
          try: 
               words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
          except Exception as e:
               print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
     return words_vector_list.tolist()

这是我的UDF

get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))

这就是我调用 udf 作为列添加到我的数据框中的方式

dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))

这是错误的堆栈跟踪

Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
    pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
    return udf_obj(*args)
  File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
    judf = self._judf
  File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
    self._judf_placeholder = self._create_judf()
  File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
    wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
    broadcast = sc.broadcast(pickled_command)
  File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
    return Broadcast(self, value, self._pickled_broadcast_vars)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
    self._path = self.dump(value, f)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
    raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set

最佳答案

在 2M 的情况下,您的 fast_text_dictionary 有多大?它可能太大了。 在运行udf之前先尝试广播它。例如

broadcastVar = sc.broadcast(fast_text_dictionary)

然后在 udf 中使用 broadcastVar 代替。

请参阅document for broadcast

关于python - Pyspark UDF无法使用大字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57560189/

相关文章:

python Pandas : case insensitive drop column

python - Vim:对 python 源文件中的类进行排序

python - 在 lxml 中使用 etree 时出错

python - 如何在 Robot Framework 中将变量定义为具有列表值的字典

scala - 如何在 Scala 中使用映射并接收索引?

python - 是否可以在docker下运行spark udf函数(主要是python)?

python - Pyspark StructType 未定义

python - HTML 到 PDF 尺寸不正确

javascript - 对 Array.prototype.reduce 的 polyfill 感到困惑

pyspark - 将列表列转换为数据框