apache-spark - 为什么 selectExpr 更改架构(包括 id 列)？

关闭。这个问题是not reproducible or was caused by typos .它目前不接受答案。

想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。

4年前关闭。

Improve this question

更新 (这使得警报虚假和无效)

重建 2.2.0-快照使用来自 master 的最新更改，而没有我对 def schema 的本地更改在 Dataset .有用。对不起，噪音:(

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Branch master
Compiled by user jacek on 2017-03-27T19:00:06Z
Revision 3fada2f502107bd5572fb895471943de7b2c38e4
Url https://github.com/apache/spark.git
Type --help for more information.

scala> spark.range(1).printSchema
root
 |-- id: long (nullable = false)


scala> spark.range(1).selectExpr("*").printSchema
root
 |-- id: long (nullable = false)

在玩弄时 selectExpr (在 2.2.0-SNAPSHOT 来自今天的主人)我注意到架构更改为包括 id柱子。我似乎无法解释。任何人？

每次启动都能重现spark-shell通过执行以下操作:

scala> spark.version
res0: String = 2.2.0-SNAPSHOT

scala> spark.range(1).printSchema
root
 |-- value: long (nullable = true)

scala> spark.range(1).explain(true)
== Parsed Logical Plan ==
Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Range (0, 1, step=1, splits=Some(8))

scala> spark.range(1).printSchema
root
 |-- value: long (nullable = true)

scala> spark.range(1).selectExpr("*").printSchema
root
 |-- id: long (nullable = false)

scala> val rangeDS = spark.range(1)
rangeDS: org.apache.spark.sql.Dataset[Long] = [value: bigint]

scala> rangeDS.selectExpr("*").printSchema
root
 |-- id: long (nullable = false)

附言看起来我似乎无法在 中重现它2.1.0 .

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Branch master
Compiled by user jacek on 2017-03-27T03:43:09Z
Revision 3fbf0a5f9297f438bc92db11f106d4a0ae568613
Url https://github.com/apache/spark.git
Type --help for more information.

最佳答案

我会说答案在于source code , 对于您传入的每个“表达式” selectExpr ，该函数创建一个新列，然后添加原始列:

def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
}

如果您看一下上面的 select 功能:

def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

您会看到它连接了从 SQL 表达式中获取的新列，并创建了一个包含它们的新数据框，以及来自原始数据框的数据框

编辑我尝试使用 2.2.0 并获得:

res7: String = 2.2.0
root
 |-- id: long (nullable = false)
root
 |-- id: long (nullable = false)

关于apache-spark - 为什么 selectExpr 更改架构(包括 id 列)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43041975/

apache-spark - 为什么 selectExpr 更改架构(包括 id 列)？

上一篇：wolfram-mathematica - 使用Mathematica从笛卡尔图到极坐标直方图

下一篇：mercurial - 如何使用草龟hg "that are part of the repository"忽略窑炉/ Mercurial 中的文件