python-3.x - 处理 pyspark 数据框中的空值

我有 pyspark dataframe 和一些数据，我想要substring 列的一些数据，该列还包含一些 null 值。这是我的数据框

+-------------+
|          Name|
+--------------+
| Asia201909284|
|    US20190928|
|Europ201909287|
|          null|
|     something|
|       nothing|
+--------------+

我想从名称列中删除亚洲、美国、欧洲

这是我已经尝试过的代码。

fun_asia = udf(lambda x: x[4:len(x)])
fun_us = udf(lambda x: x[2:len(x)])
fun_europ = udf(lambda x: x[5:len(x)])
df1.withColumn("replace", \
               when(df1.Name.isNull(),df1.Name)\
               .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .otherwise(df1.Name)
              ).show()

如果该列中没有 null 值，则它可以正常工作。但如果有一些 null 值，则会出现类似 len() cantcalculate null value 的错误。

错误消息

TypeError:“NoneType”类型的对象没有 len()

我很困惑为什么它的调用 fun 也适用于 null 值。我怎样才能克服我的问题并得到我想要的结果，任何帮助表示赞赏。

我想要的实际结果

+--------------+---------+
|          Name|  replace|
+--------------+---------+
| Asia201909284|201909284|
|    US20190928| 20190928|
|Europ201909287|201909287|
|          null|     null|
|     something|something|
|       nothing|  nothing|
+--------------+---------+

最佳答案

一种方法是使用 when 和 isNull() 条件来处理 when 列为 null 条件:

df1.withColumn("replace", \
               when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .when(df1.Name.isNull(), df1.Name)
               .otherwise(df1.Name)
              ).show()

编辑2:

您可以更改 udf 来处理空值:

fun_asia = udf(lambda x: x[4:len(x)] if x else None)
fun_us = udf(lambda x: x[2:len(x)] if x else None)
fun_europ = udf(lambda x: x[5:len(x)] if x else None)
df1.withColumn("replace", \
               when(df1.Name.isNull(),df1.Name)\
               .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .otherwise(df1.Name)
              ).show()
+--------------+---------+
|          Name|  replace|
+--------------+---------+
| Asia201909284|201909284|
|    US20190928| 20190928|
|Europ201909287|201909287|
|          null|     null|
|     something|something|
|       nothing|  nothing|
+--------------+---------+

关于python-3.x - 处理 pyspark 数据框中的空值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58424076/

python-3.x - 处理 pyspark 数据框中的空值

编辑2:

上一篇：angular - 为什么我可以通过修改从选择器接收的对象来修改状态？

下一篇：r - 将函数中的变量传递给 data.table 以进行 lm()