apache-spark - Spark 如何处理大于集群内存的数据

标签 apache-spark

如果我只有 1 个内存为 25 GB 的执行程序,并且一次只能运行一项任务,那么是否可以处理(转换和操作)1 TB 数据如果是,那么它将如何读取以及中间数据将存储在哪里?

同样对于相同的场景,如果 hadoop 文件有 300 个输入拆分,那么 RDD 中将有 300 个分区,那么在这种情况下,这些分区在哪里?
它将只保留在 hadoop 磁盘上,而我的单个任务将运行 300 次吗?

最佳答案

我在 hortonworks 网站上找到了一个很好的答案。

Contrary to popular believe Spark is not in-memory only

a) Simple read no shuffle ( no joins, ... )

For the initial reads Spark like MapReduce reads the data in a stream and > processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )

So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.

b) Shuffle

This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.

c) After Shuffle

RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.

d) Spark Streaming

This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.


Here's is the original page.
Matei Zaharia 最初的 Spark 设计论文也有帮助。 (section 2.6.4 Behavior with Insufficient Memory)
希望有什么有用的东西。

关于apache-spark - Spark 如何处理大于集群内存的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44961602/

相关文章:

elasticsearch - Logstash 或 Elasticsearch 与 Apache Spark 流集成

python - 指定 Parquet 属性 pyspark

apache-spark - 在 rapidminer : error occurred during submitting or starting the spark job 上运行 Spark

scala - Spark 中有哪些不同的联接类型?

apache-spark - pyspark 在 udf 中使用数据框

apache-spark - 如何找到哪个分区倾斜(在连接大表时)?

apache-spark - 无法在 Apache Spark 中创建 HIVE 表

hadoop - Apache Spark YARN 模式启动时间过长(超过 10 秒)

apache-spark - 使用 Spark Structured Streaming 处理后删除文件

apache-spark - Spark 2.x + Tika : java. lang.NoSuchMethodError : org. apache.commons.compress.archivers.ArchiveStreamFactory.detect