scala - 共享 HDInsight SPARK SQL 表 saveAsTable 不起作用

标签 scala apache-spark apache-spark-sql tableau-api azure-hdinsight

我想使用画面显示来自 HDInsight SPARK 的数据。我在关注 this video他们在其中描述了如何连接两个系统并公开数据。

目前我的脚本本身非常简单,如下所示:

 /* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer@mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")

// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)

// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
    s => MyData(s(0),
            s(1),
            s(2),
            s(3),
            s(4).toDouble,
            s(5)         
    )
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table") 

不幸的是我遇到了以下错误

warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)

我还尝试过使用下面的代码覆盖表(如果存在)

   import org.apache.spark.sql.SaveMode
    myData.saveAsTable("test_table", SaveMode.Overwrite) 

但它仍然给我同样的错误。

warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

有人可以帮我解决这个问题吗?

最佳答案

我知道这是我的错误,但我会把它留作答案,因为它在任何博客或论坛答案中都不容易找到。希望它能帮助像我这样开始使用 Spark 的人

我发现 .toDF() 实际上创建了 sqlContext 而不是基于 hiveContextDataFrame。所以我现在更新了我的代码如下

// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
    s => MyData(s(0),
            s(1),
            s(2),
            s(3),
            s(4).toDouble,
            s(5)         
    )
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")

还要确保 s(4) 具有正确的 double 值,否则添加 try/catch 来处理它。我做了这样的事情:

def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))

问候 基兰

关于scala - 共享 HDInsight SPARK SQL 表 saveAsTable 不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36082831/

相关文章:

regex - Scala 正则表达式 "starts with lowercase alphabets"不起作用

apache-spark - 让Spark SQL知道Primary,Foreign,Not NULL约束的方法

hadoop - 无法在Spark Streaming作业中获得广播_1的广播_1_piece0

eclipse - 在 Maven 项目中使用 HiveContext

java - 如何在 Play Frame work 2.3.8 上安装 SecureSocial?

java - :_* do when calling a Java vararg method from Scala? 是什么意思

scala - 猫 EitherT.collectRight 找不到替代品[ future ]

apache-spark - Spark 与 Hive 的差异与 ANALYZE TABLE 命令 -

Scala Spark DataFrame : dataFrame. 根据列名序列选择多个列

sql - rowsBetween 和 rangeBetween 有什么区别?