apache-spark - Spark : create a nested schema

标签 apache-spark dataframe apache-spark-sql schema

随着 Spark ,

import spark.implicits._
val data = Seq(
  (1, ("value11", "value12")),
  (2, ("value21", "value22")),
  (3, ("value31", "value32"))
  )

 val df = data.toDF("id", "v1")
 df.printSchema()

结果如下:

root
|-- id: integer (nullable = false)
|-- v1: struct (nullable = true)
|    |-- _1: string (nullable = true)
|    |-- _2: string (nullable = true)

现在如果我想自己创建schema,应该怎么处理?

val schema = StructType(Array(
  StructField("id", IntegerType),
  StructField("nested", ???)
))

谢谢。

最佳答案

根据这里的例子: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/types/StructType.html

 import org.apache.spark.sql._
 import org.apache.spark.sql.types._

 val innerStruct =
   StructType(
     StructField("f1", IntegerType, true) ::
     StructField("f2", LongType, false) ::
     StructField("f3", BooleanType, false) :: Nil)

 val struct = StructType(
   StructField("a", innerStruct, true) :: Nil)

 // Create a Row with the schema defined by struct
 val row = Row(Row(1, 2, true))

在您的情况下,它将是:

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema = StructType(Array(
  StructField("id", IntegerType),
  StructField("nested", StructType(Array(
      StructField("value1", StringType),
      StructField("value2", StringType)
  )))
))

输出:

StructType(
  StructField(id,IntegerType,true), 
  StructField(nested,StructType(
    StructField(value1,StringType,true), 
    StructField(value2,StringType,true)
  ),true)
)

关于apache-spark - Spark : create a nested schema,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57079343/

相关文章:

python - 将新元素添加到结构 pyspark 的嵌套数组

apache-spark - 属性错误 : 'StructField' object has no attribute '_get_object_id' : with loading parquet file with custom schema

python - 删除数据框中每个 ID 的前 n 行

python-Pandas df.sum() 跨多列意外 arg 'axis' 错误

scala - 如何比较两个共享相同内容的 StructType?

apache-spark - 使用 Window() 计算 PySpark 中数组的滚动总和?

scala - Spark 任务无法将行写入 ORC 表

scala - Spark异常: Cannot load main class from JAR file:/root/master

java - spark-sql 中的 NullPointerException

当 r 中没有变量值时 reshape