dataframe - 将数据框的架构更改为其他架构

标签 dataframe apache-spark apache-spark-sql pyspark

我有一个看起来像这样的数据框

df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- foo01 string (nullable = true)
 |    |-- bar01 string (nullable = true)
 |    |-- foo02 string (nullable = true)
 |    |-- bar02 string (nullable = true)

我想把它改成

root
 |-- id: integer (nullable = true)
 |-- foo: struct (nullable = true)
 |    |-- foo01 string (nullable = true)
 |    |-- foo02 string (nullable = true)
 |-- bar: struct (nullable = true)
 |    |-- bar01 string (nullable = true)
 |    |-- bar02 string (nullable = true)

解决这个问题的最佳方法是什么？

最佳答案

您可以简单地使用struct Pyspark 函数。

from pyspark.sql.functions import struct

new_df = df.select(
  'id',
  struct('data.foo01', 'data.foo02').alias('foo'),
  struct('data.bar01', 'data.bar02').alias('bar'),
)

与 struct Pyspark 函数相关的附加说明:它可以采用字符串列名称列表来仅将列移动到结构中，或者如果您需要表达式列表。

关于dataframe - 将数据框的架构更改为其他架构，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64335890/

上一篇：sorting - Splunk:如何获取每组的 N 个最新值？

下一篇：kubernetes - 我在 kubernetes 服务 yaml 上收到错误 "map"，预期为 "string"

python - 将 Pandas 中的滚动相关输出简化为单个索引数据帧

scala - Spark 中的嵌套 JSON

scala - Spark joinWithCassandraTable() 映射多个分区键错误

java - 阶段 0.0 (tid 0) java.lang.ArithmeticException 中的 Spark ERROR 执行程序 : Exception in task 0. 0

apache-spark - 详细说明为什么 shuffle 写入数据比 apache spark 中的输入数据要多

scala - 填补时间序列 Spark 中的空白

apache-spark - LazyStruct : Extra bytes detected at the end of the row! 忽略类似问题

dataframe - 如何在 julia 中编写更健壮的管道

python - 给定一个包含许多列的数据框，有没有一种方法可以让两个循环只遍历其中的两列？