scala - Spark : Replace Null value in a Nested column

标签 scala apache-spark apache-spark-sql

我想将以下数据框中的所有 n/a 值替换为 unknown。 它可以是标量复杂嵌套列。 如果它是一个 StructField 列,我可以遍历这些列并使用 WithColumn 替换 n\a。 但我希望以 generic way 完成此操作,尽管该列的 type 因为我不想明确指定列名,因为在我的例子中有 100 个列名?

case class Bar(x: Int, y: String, z: String)
case class Foo(id: Int, name: String, status: String, bar: Seq[Bar])

val df = spark.sparkContext.parallelize(
Seq(
  Foo(123, "Amy", "Active", Seq(Bar(1, "first", "n/a"))),
  Foo(234, "Rick", "n/a", Seq(Bar(2, "second", "fifth"),Bar(22, "second", "n/a"))),
  Foo(567, "Tom", "null", Seq(Bar(3, "second", "sixth")))
)).toDF

df.printSchema
df.show(20, false)

结果:

+---+----+------+---------------------------------------+
|id |name|status|bar                                    |
+---+----+------+---------------------------------------+
|123|Amy |Active|[[1, first, n/a]]                      |
|234|Rick|n/a   |[[2, second, fifth], [22, second, n/a]]|
|567|Tom |null  |[[3, second, sixth]]                   |
+---+----+------+---------------------------------------+   

预期输出:

+---+----+----------+---------------------------------------------------+
|id |name|status    |bar                                                |
+---+----+----------+---------------------------------------------------+
|123|Amy |Active    |[[1, first, unknown]]                              |
|234|Rick|unknown   |[[2, second, fifth], [22, second, unknown]]        |
|567|Tom |null      |[[3, second, sixth]]                               |
+---+----+----------+---------------------------------------------------+

对此有什么建议吗?

最佳答案

如果你喜欢玩 RDD,这里有一个简单、通用和进化的解决方案:

  val naToUnknown = {r: Row =>
    def rec(r: Any): Any = {
      r match {
        case row: Row => Row.fromSeq(row.toSeq.map(rec))
        case seq: Seq[Any] => seq.map(rec)
        case s: String if s == "n/a" => "unknown"
        case _ => r
      }
    }
    Row.fromSeq(r.toSeq.map(rec))
  }

  val newDF = spark.createDataFrame(df.rdd.map{naToUnknown}, df.schema)
  newDF.show(false)

输出:

+---+----+-------+-------------------------------------------+
|id |name|status |bar                                        |
+---+----+-------+-------------------------------------------+
|123|Amy |Active |[[1, first, unknown]]                      |
|234|Rick|unknown|[[2, second, fifth], [22, second, unknown]]|
|567|Tom |null   |[[3, second, sixth]]                       |
+---+----+-------+-------------------------------------------+

关于scala - Spark : Replace Null value in a Nested column,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59536407/

相关文章:

scala - 多scala数据帧连接

scala - Play Framework 2.1 中的 AbsoluteURI 支持

scala - Play Framework 听本地主机

apache-spark - 使用什么工具来可视化逻辑和物理查询计划?

scala - Spark 2.2.0 兼容的 Scala 版本吗?

python - 如何在 ML pyspark 管道中添加我自己的函数作为自定义阶段?

sql - 如何在 Spark SQL 中对分解字段进行 GROUP BY?

涉及 case 语句和类型成员的 Scala 类型推断

scala - Scala中的惰性迭代器?

python - 如何在 Spark 的 map 函数中使用数据框?