hadoop - 什么时候我们不应该在配置单元中使用分桶？

我们什么时候不应该在 Hive 中使用分桶？该技术的瓶颈是什么？

最佳答案

我想当您无法从中受益时，您不必使用分桶。据我所知，分桶的主要好处包括:更高效的采样和映射端连接(见下文)。因此，如果您的表很小或者您不需要快速采样和映射端连接就不要使用它，因为您需要记住您必须在插入之前对数据进行存储，手动或使用 set hive。 enforce.bucketing = true; 没有瓶颈，它只是允许您在某些情况下利用的一种可能的数据布局。

Hive 映射端连接示例 ( see more here ):

If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join

SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key

can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

set hive.optimize.bucketmapjoin = true

更新分桶时考虑数据倾斜。

使用 hash_function(bucketing_column) mod num_buckets 计算的桶数。如果您的分桶列是 int 类型，则 hash_int(i) == i( see more here )。因此，如果您在该列中有偏斜值，例如一个值比其他值出现得更频繁，那么更多行将被放置在相应的存储桶中，您将拥有不成比例的存储桶，这会损害查询速度。 Hive 具有克服数据偏斜的内置工具(请参阅 Skewed Tables )，但我认为您首先不应使用具有偏斜数据的列进行分桶。

关于hadoop - 什么时候我们不应该在配置单元中使用分桶？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39815797/

hadoop - 什么时候我们不应该在配置单元中使用分桶？

上一篇：scala - 当我尝试通过 Cloudera VM 在 spark 中运行 scala 命令时，topology.py 出现语法错误

下一篇：hadoop - hive ，直线 : Peer indicated failure: GSS initiate failed