scala - Spark : coalesce very slow even the output data is very small

标签 scala apache-spark coalesce

我在 Spark 中有以下代码:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .saveAsTextFile("myOutput")

myOutput文件夹中有2000+个文件,但是t.getMyEnum() == null只有几个,所以输出记录很少。由于我不想只搜索 2000 多个输出文件中的几个输出,我尝试使用 coalesce 组合输出,如下所示:
myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .coalesce(1, false)
      .saveAsTextFile("myOutput")

然后工作变得非常慢!我想知道为什么它这么慢?只有少数输出记录分散在 2000 多个分区中?有没有更好的方法来解决这个问题?

最佳答案

if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.


因此,尝试将 true 传递给 coalesce功能。 IE。
myData.filter(_.getMyEnum == null)
      .map(_.toString)
      .coalesce(1, shuffle = true)
      .saveAsTextFile("myOutput")

关于scala - Spark : coalesce very slow even the output data is very small,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31056476/

相关文章:

python - Scala:递归修改元素/列表的列表

inheritance - 如何在 Scala 中获取对象的特定类型并将对象强制转换为该类型?

python - Spark 数据框不添加具有空值的列

sql-server - 检查输入参数是否为 Null 并在 SQL Server 中的位置使用它

sql - 从 SQL 中选择多行到子查询中的一行

sql - ISNULL 等效于空字段

scala - 在 Scala 中生成惰性 "spiral"

scala - 如何通过代理使用 cooursier

java - 连接 2 个 Spark 数据帧,以列表形式获取结果

python - pandas和spark在python中处理 'in-memory'方面的区别