apache-spark - 如果 Spark 中的 cache() 无法将数据放入内存中会发生什么？

标签 apache-spark cluster-computing distributed-computing

我是 Spark 的新手。我在多个地方读到在 RDD 上使用 cache() 会导致它存储在内存中，但到目前为止我还没有找到关于“如何确定最大数据大小”的明确指导方针或经验法则塞进内存？如果我调用“缓存”的数据量超过内存，会发生什么？它会导致我的工作失败，还是会在完成后对集群性能产生显着影响？

谢谢!

最佳答案

正如 the official documentation 中明确说明的那样与 MEMORY_ONLY持久性(相当于 cache ):

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

即使数据适合内存，如果有新数据进入，它也可能被驱逐。实际上，缓存更像是一种提示而不是契约。您不能依赖缓存发生，但如果它成功，您也不必依赖。

备注 :

请记住默认 StorageLevel为 Dataset是 MEMORY_AND_DISK ，这将:

If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

另见 :

(Why) do we need to call cache or persist on a RDD

Why do I have to explicitly tell Spark what to cache?

关于apache-spark - 如果 Spark 中的 cache() 无法将数据放入内存中会发生什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35708833/

上一篇：python - SQLAlchemy `Float` 、 `FLOAT` 、 `REAL` 之间的区别

下一篇：macos - SecItemCopyMatching 无法读取 iCloud 钥匙串(keychain)

java - 关于消息发布的可分发客户端的建议

dataframe - 重命名 PySpark DataFrame 聚合的列

apache-spark - PySpark:标记点 RDD 的许多功能

apache-spark - Spark中密集等级和行数的差异

python - pyzmq REQ/REP 与异步等待变量

java - COMPASs Monitor 不显示任何应用程序

java - Spark-线程 java.lang.NoSuchMethodError 中的异常

java - Hadoop map reduce over totient sum

postgresql - 在 psql 中通过复制分发数据时出现 Postgres XC 错误