我正在读取一个 Hive 表,该表有两列:id
和 jsonString
。我可以轻松地将 jsonString
转换为 Spark 数据结构,调用 spark.read.json
函数,但我必须将列 id
添加为好吧。
val jsonStr1 = """{"fruits":[{"fruit":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr2 = """{"fruits":[{"dt":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
val jsonStr3 = """{"fruits":[{"a":"banana"},{"fruid":"apple"},{"fruit":"pera"}],"bar":{"foo":"[\"daniel\",\"pedro\",\"thing\"]"},"daniel":"daniel data random","cars":["montana","bagulho"]}"""
case class Foo(id: Integer, json: String)
val ds = Seq(new Foo(1,jsonStr1), new Foo(2,jsonStr2), new Foo(3,jsonStr3)).toDS
val jsonDF = spark.read.json(ds.select($"json").rdd.map(r => r.getAs[String](0)).toDS)
jsonDF.show()
jsonDF.show
+--------------------+------------------+------------------+--------------------+
| bar| cars| daniel| fruits|
+--------------------+------------------+------------------+--------------------+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|
+--------------------+------------------+------------------+--------------------+
我想从 Hive 表中添加列 id
,如下所示:
+--------------------+------------------+------------------+--------------------+---------------
| bar| cars| daniel| fruits| id
+--------------------+------------------+------------------+--------------------+--------------
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...|1
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...|2
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...|3
+--------------------+------------------+------------------+--------------------+
我不会使用正则表达式
我创建了一个 udf,它将这两个字段作为参数,并使用适当的 JSON 库包含所需的 field(id)
并返回一个新的 JSON 字符串。它的工作方式就像一个魅力,但我希望 Spark API 提供更好的方法来做到这一点。我正在使用 Apache Spark 2.3.0。
最佳答案
我之前已经了解 from_json
函数,但就我而言,手动推断每个 JSON 的架构是“不可能的”。我认为 Spark 会有一个“惯用的”界面。
这是我的最终解决方案:
ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show
ds.select($"id", from_json($"json", jsonDF.schema).alias("_json_path")).select($"_json_path.*", $"id").show
+--------------------+------------------+------------------+--------------------+---+
| bar| cars| daniel| fruits| id|
+--------------------+------------------+------------------+--------------------+---+
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[,,, banana], [,...| 1|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[, banana,,], [,...| 2|
|[["daniel","pedro...|[montana, bagulho]|daniel data random|[[banana,,,], [,,...| 3|
+--------------------+------------------+------------------+--------------------+---+
关于json - Apache Spark 读取带有额外列的 JSON,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55074331/