我有以下数据框:
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","bad","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James | | Smith|36636| M| 3000|
| Michael | Rose| |40288| M| 4000|
| Robert | |Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|Rose | F| -1|
+---------+----------+--------+-----+------+------+
我正在下面运行示例代码,我想在转换后将字符串列转换为整数。
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
+-----+
| id|
+-----+
|36636|
|40288|
|42114|
|39192|
| null|
+-----+
这里所有整数数据都正确转换,但“Rose”转换为 null。
您能帮我解决一下当有不良记录时如何抛出异常吗? 是否有任何 Spark 配置设置?
此外,如果查询中存在多个强制转换,如何获取出现此问题的确切列名。
最佳答案
如果转换出错,Spark 不会抛出异常。
作为捕获这些错误的自定义方法,您可以编写 UDF如果你强制转换为空,就会抛出异常。但这会降低脚本的性能,因为 Spark 无法优化 UDF 执行。
关于scala - Spark sql 在进行数据类型转换时将坏记录转换为 Null,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70111987/