apache-spark - 在 spark scala 中过滤具有两种不同模式的列

标签 apache-spark apache-spark-sql

我有一个包含三列的数据框; ID、CO_ID 和 DATA,其中 DATA 列有两种不同的架构,如下所示:

|ID  |CO_ID |Data
|130 |NA    | [{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]
|536 |NA    | [{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]   
|518 |NA    | null
|938 |611   | {"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}                                                                                                                           
|742 |NA    | {"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}

现在我想创建一个包含 ID、CO_ID、NUMBER、ADDRESS 和 NAME 列的数据框。如果没有值,则在 NUMBER、ADDRESS 和 NAME 中填充值 null。

首先我必须用不同的模式过滤上面的数据框,我该怎么做?

最佳答案

这是一种方法:

val df = Seq(
(130, "NA","""[{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 231, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]"""),
(536, "NA","""[{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 232, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}}]"""),
(518,"NA", null),
(938, "611", """{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 233, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}"""),
(742, "NA", """{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 234, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}"""))
.toDF("ID","CO_ID","Data")


import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.{from_json, array, when, length, lit}

val schema = (new StructType)
   .add("NUMBER", "string", true)
   .add("ADDRESS", "string", true)
   .add("PHONE", "string", true)
   .add("NAME", "string", true)

val df_ar = df.withColumn("json", 
                       when($"data"
                         .startsWith("[{") && $"data".endsWith("}]"), $"data".substr(lit(2), length($"data") - 2))
                         .otherwise($"data")) //checks whether data start with '[{' and ends with '}]' if it does removes []
              .withColumn("json", from_json($"json", schema)) //covert to JSON based on given schema
              .withColumn("number", $"json.NUMBER")
              .withColumn("address", $"json.ADDRESS")
              .withColumn("name", $"json.NAME")

df_ar.select("ID", "CO_ID", "number", "address", "name").show(false)

此解决方案首先从 JSON 字符串中删除 [],然后应用给定的模式将字符串 JSON 转换为 StructType 列。

输出:

+---+-----+------+-----------------------+------+
|ID |CO_ID|number|address                |name  |
+---+-----+------+-----------------------+------+
|130|NA   |AW9F  |PLOT NO. 231, JAIPUR RJ|SACHIN|
|536|NA   |AW9F  |PLOT NO. 232, JAIPUR RJ|SACHIN|
|518|NA   |null  |null                   |null  |
|938|611  |AW9F  |PLOT NO. 233, JAIPUR RJ|SACHIN|
|742|NA   |AW9F  |PLOT NO. 234, JAIPUR RJ|SACHIN|
+---+-----+------+-----------------------+------+

关于apache-spark - 在 spark scala 中过滤具有两种不同模式的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55947431/

相关文章:

apache-spark-sql - 在 azure 数据 block 中查询 sql server 表

java - Apache Spark 连接具有相同列名(具有不同数据)的表的数据集

scala - Spark 中将字符串字段转换为时间戳的更好方法

azure - 如何将大量 DDL 从 Azure Databricks 的开发实例导入/重新创建到生产实例

python - 通过将逗号分隔的列的值替换为基于另一个数据框的查找来创建新列

performance - Apache Spark/Cassandra 集群上的过度分区(任务过多)

scala - Apache Spark 3 和向后兼容性?

python - rlike中的pyspark数据帧如何从数据帧列之一逐行传递字符串值

amazon-web-services - 是否可以通过 EMR(通过 VPC)查看 Spark UI?

cassandra - Spark Cassandra 连接器 - where 子句