apache-spark - 桶和分区是什么关系？

RDD 的分区和 RDD 的内容在 shuffle 操作之前映射到的 Bucket 之间是否存在关系？

其次，具有相同键的所有键值对是否会被混洗到同一个桶中，还是键值对分配到桶中是随机的？指定分区器(散列/范围)对这个分布有什么影响吗？

最佳答案

Is there a relationship between the Partitions of an RDD and the Buckets which the contents of the RDD get mapped to before a shuffle operation ?

如果您询问 分桶表 (在 bucketBy 和 spark.table("bucketed_table") 之后)我认为答案是肯定的。

让我告诉你我回答"is"的意思。

val large = spark.range(1000000)
scala> println(large.queryExecution.toRdd.getNumPartitions)
8

scala> large.write.bucketBy(4, "id").saveAsTable("bucketed_4_id")
18/04/18 22:00:58 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`bucketed_4_id` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

scala> println(spark.table("bucketed_4_id").queryExecution.toRdd.getNumPartitions)
4

换句话说，分区数(在加载分桶表之后)正是桶数(您在保存时定义的)。

Secondly, will all key value pairs with same key be shuffled to the same bucket or is the distribution of key value pairs to buckets random?

Spark 2.3(我相信早期版本的工作方式类似)对每个分区进行分桶(写入任务)，即每个分区都有您定义的桶数。

在上述情况下，您最终将得到 8(分区)x 4(存储桶)= 32 个存储桶文件(_SUCCESS 有两行额外的行，标题为 34)。

$ ls -ltr spark-warehouse/bucketed_4_id | wc -l
      34

Does specifying a partitioner (hash/range) have any effect on this distribution?

我认为是这样，因为分区器用于跨分区分发数据。

关于apache-spark - 桶和分区是什么关系？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31971150/

apache-spark - 桶和分区是什么关系？

上一篇：many-to-many - Eloquent ORM/Laravel - 使用自定义数据透视表结构

下一篇：haskell - 解决极端情况 Haskell 模块导入和导出