我经常在 Apache Spark 中使用窗口函数,例如计算累积和。到目前为止,我从未指定过帧,因为输出是正确的。但最近我在博客(https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)中读到:
In addition to the ordering and partitioning, users need to define the start boundary of the frame, the end boundary of the frame, and the type of the frame, which are three components of a frame specification.
所以我想知道使用未指定的框架是否安全,例如:
import org.apache.spark.sql.expressions.Window
val df = (1 to 10000).toDF("i")
df
.select(
$"i",
sum($"i").over(Window.orderBy($"i")).as("running_sum1"),//unspecified frame
sum($"i").over(Window.orderBy($"i").rowsBetween(Window.unboundedPreceding, Window.currentRow)).as("running_sum2") // specified frame
)
.show()
+---+------------+------------+
| i|running_sum1|running_sum2|
+---+------------+------------+
| 1| 1| 1|
| 2| 3| 3|
| 3| 6| 6|
| 4| 10| 10|
| 5| 15| 15|
| 6| 21| 21|
| 7| 28| 28|
| 8| 36| 36|
| 9| 45| 45|
| 10| 55| 55|
| 11| 66| 66|
| 12| 78| 78|
| 13| 91| 91|
| 14| 105| 105|
| 15| 120| 120|
| 16| 136| 136|
| 17| 153| 153|
| 18| 171| 171|
| 19| 190| 190|
| 20| 210| 210|
+---+------------+------------+
显然它们给出了相同的输出,但是在某些情况下使用未指定的框架是危险的吗?顺便使用 Spark 2.x。
最佳答案
是的,很安全。
查看 github 上 Window
对象的主分支源代码,有如下注释(2.3.0分支中不存在):
When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
换句话说,当窗口上存在排序时,即通过使用 orderBy
,框架上未指定的边界等于具有:
rowsBetween(Window.unboundedPreceding, Window.currentRow)
在不使用orderBy
的情况下,默认是一个entery无界窗口:
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
进一步调查显示,自从 Spark 1.4.0 中引入窗口函数以来,就一直使用这些默认值,相关 github branch :
def defaultWindowFrame(
hasOrderSpecification: Boolean,
acceptWindowFrame: Boolean): SpecifiedWindowFrame = {
if (hasOrderSpecification && acceptWindowFrame) {
// If order spec is defined and the window function supports user specified window frames,
// the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SpecifiedWindowFrame(RangeFrame, UnboundedPreceding, CurrentRow)
} else {
// Otherwise, the default frame is
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
SpecifiedWindowFrame(RowFrame, UnboundedPreceding, UnboundedFollowing)
}
}
关于scala - 在 Spark 中使用带有未定义框架的 WindowSpec 是否安全?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50637144/