apache-spark - 在 Spark 中使用窗口函数

我正在尝试在 Spark 数据帧中使用 rowNumber。我的查询在 Spark shell 中按预期工作。但是当我在 eclipse 中写出它们并编译一个 jar 时，我遇到了一个错误

 16/03/23 05:52:43 ERROR ApplicationMaster: User class threw exception:org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;

我的疑问

import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"value".desc)

val dfTop = df.withColumn("rn", rowNumber.over(w)).where($"rn" <= 3).drop("rn")

在 Spark shell 中运行查询时，我没有使用 HiveContext。不知道为什么当我像 jar 文件一样运行时它会返回错误。如果有帮助，我还在 Spark 1.6.0 上运行脚本。有没有人遇到过类似的问题？

最佳答案

我已经回答了 similar question前。错误信息说明了一切。与 spark < 版本 2.x ，你需要一个 HiveContext在您的应用程序 jar 中，别无他法。

您可以进一步了解 SQLContext 和 HiveContext 之间的区别 here .
SparkSQL有一个 SQLContext和一个 HiveContext . HiveContext是SQLContext的超集. Spark 社区建议使用 HiveContext .您可以看到，当您运行交互式驱动程序应用程序 spark-shell 时，它会自动创建一个 SparkContext定义为 sc 和 HiveContext定义为 sqlContext . HiveContext允许您执行 SQL 查询以及 Hive 命令。

您可以尝试检查 spark-shell 的内部:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)

scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
res0: Boolean = true

scala> sqlContext.isInstanceOf[org.apache.spark.sql.SQLContext]
res1: Boolean = true

scala> sqlContext.getClass.getName
res2: String = org.apache.spark.sql.hive.HiveContext

通过继承，HiveContext实际上是一个 SQLContext ，但反过来就不是这样了。您可以查看 source code如果您更想知道如何做 HiveContext继承自 SQLContext .

自 Spark 2.0 ，您只需要创建一个 SparkSession (作为单个入口点)删除 HiveContext/SQLContext混淆问题。

关于apache-spark - 在 Spark 中使用窗口函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36171349/

apache-spark - 在 Spark 中使用窗口函数

上一篇：web-services - 编写安全的 REST Web 服务

下一篇：eclipse - 通过Maven为Eclipse设置vm默认参数