apache-spark - 在pyspark中将字符串列表转换为二进制列表

我有一个这样的数据框

data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])), 
    (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)

+---+----------------------------+
|ID |MonthList                   |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May]         |
|ID3|[October, June]             |
+---+----------------------------+

我想将每一行与默认列表进行比较，这样如果该值存在，则分配 1 else 0

default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']

因此我的预期输出是这个

+---+----------------------------+------------------+
|ID |MonthList                   |Binary_MonthList  |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May]         |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June]             |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+

我可以在 python 中做到这一点，但不知道如何在 pyspark 中做到这一点

最佳答案

你可以试试用这样的udf .

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType

default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']

def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))

df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))

df.show()
# output
+---+--------------------+------------------+
| ID|           MonthList|  Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3|     [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+

关于apache-spark - 在pyspark中将字符串列表转换为二进制列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58303468/

apache-spark - 在pyspark中将字符串列表转换为二进制列表

上一篇：SQL查询以按日期范围折叠重复的值

下一篇：perl - 您如何使用Perl中的sed？