scala - df.SaveAsTable和spark.sql(创建表..)之间的区别

引用here关于saveastable和insertInto之间的区别

以下两种方法有什么区别:

df.saveAsTable("mytable");

和

df.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists " + my_temp_table);
spark.sql("create table mytable as select * from 
my_temp_table");

在哪种情况下该表存储在内存中，在哪种情况下物理存储在磁盘上？

另外，根据我的理解，createOrReplaceTempView仅注册要通过Hive查询访问的数据帧(已在内存中)，而无需实际保留它，是否正确？

我必须加入数百个表并击中OutOfMemory问题。
就效率而言，最好的方法是什么？

df.persist()和df.join(..)。join(..)。join(..)....

createOrReplaceTempView然后与spark.sql()，

一起加入

SaveAsTable(？不确定下一步)

使用Create Table写入磁盘，然后使用spark.sql()加入？

最佳答案

让我们逐步进行。

在df.saveAsTable("mytable")的情况下，该表实际上已写入存储(HDFS/S3)。这是一个Spark Action 。

另一方面:df.createOrReplaceTempView("my_temp_table")是一个转换。它只是用于df的DAG的标识符。实际上没有任何内容存储在内存或磁盘中。
spark.sql("drop table if exists " + my_temp_table)删除表。
spark.sql("create table mytable as select * from my_temp_table")在存储上创建mytable。 createOrReplaceTempView在global_temp数据库中创建表。

最好将查询修改为:
create table mytable as select * from global_temp.my_temp_table

createOrReplaceTempView only register the dataframe (already in memory) to be accessible through Hive query, without actually persisting it, is it correct?

是的，对于大型DAG，spark将根据spark.memory.fraction设置自动缓存数据。检查this页面。

I have to Join hundreds of tables and hit OutOfMemory issue. In terms of efficiency, what would be the best way ?
df.persist() and df.join(..).join(..).join(..).... #hundred joins

createOrReplaceTempView then join with spark.sql(),

SaveAsTable (? not sure the next step)

Write to disk with Create Table then join with spark.sql()?

persist将根据可用内存以缓存格式存储一些数据，对于通过连接数百个表而生成的终端表，这可能不是最佳方法。

不可能提出适合您的方法，但以下是一些一般模式:

如果写操作失败并带有OOM，并且使用默认的spark.shuffle.partitions，则起点是增加混洗分区数，以确保根据其内存可用性正确调整每个执行程序的分区大小。

可以在不同的联接之间设置spark.shuffle.partitions设置，它在Spark作业中不需要是常量。

如果涉及多个表，则计算分区大小将变得困难。在这种情况下，写磁盘并在大表之前读回是一个好主意。

对于小于2GB的小 table ，可以进行广播。默认限制为10MB(我认为)，但是可以更改。

最好将最终表存储在磁盘上，而不是通过临时表为节俭的客户端提供服务。

祝你好运!

关于scala - df.SaveAsTable和spark.sql(创建表..)之间的区别，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55692990/

scala - df.SaveAsTable和spark.sql(创建表..)之间的区别

上一篇：scala - 如何从分组数据中获取 Spark 数据帧

下一篇：visual-studio-2010 - 如何在 Windows Phone 7 中使用 BinaryFormatter