python - 如何在 PySpark 中创建一个返回字符串数组的 udf？

标签 python apache-spark pyspark apache-spark-sql user-defined-functions

我有一个返回字符串列表的 udf。这不应该太难。我在执行 udf 时传入了数据类型，因为它返回一个字符串数组:ArrayType(StringType)。

现在，不知何故这不起作用:

我正在操作的数据框是 df_subsets_concat，看起来像这样:

df_subsets_concat.show(3,False)

+----------------------+
|col1                  |
+----------------------+
|oculunt               |
|predistposed          |
|incredulous           |
+----------------------+
only showing top 3 rows

代码是

from pyspark.sql.types import ArrayType, FloatType, StringType

my_udf = lambda domain: ['s','n']
label_udf = udf(my_udf, ArrayType(StringType))
df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))

结果是

/usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull)
    288         False
    289         """
--> 290         assert isinstance(elementType, DataType), "elementType should be DataType"
    291         self.elementType = elementType
    292         self.containsNull = containsNull

AssertionError: elementType should be DataType

据我了解，这是执行此操作的正确方法。这里有一些资源: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType" How to return a "Tuple type" in a UDF in PySpark?

但是这些都没有帮助我解决为什么这不起作用。我正在使用 pyspark 1.6.1。

如何在 pyspark 中创建一个返回字符串数组的 udf？

最佳答案

你需要初始化一个StringType实例:

label_udf = udf(my_udf, ArrayType(StringType()))
#                                           ^^ 
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
|        col1|subset|
+------------+------+
|     oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+

关于python - 如何在 PySpark 中创建一个返回字符串数组的 udf？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47682927/

上一篇：python - 来自调用 Lambda 的 AWS API Gateway 的格式错误的 Lambda 代理响应

下一篇：python - 检测 python 字符串的开头

hadoop - Apache Shark 0.9.1 无法连接到 HDFS？

scala - Spark Scala 将数据帧拆分为相同数量的行

python - 使用 PySpark 在数据框上应用 sklearn 训练模型

python - 验证字段值的有效方法 Spark

javascript - 使用闭包编译器缩小包含 Jinja2 表达式的 JavaScript 代码

python - 如何通过 Python urllib 将 JSON 数据发送到 Django 应用程序？

python - pyspark 'DataFrame' 对象没有属性 '_get_object_id'

python - 如何在kivy中仅接受数字输入( double )

python - 提取 2 个给定字典的值的差异 - 值是字符串的元组