sql - 高基数字段的 Hive 查询性能

我在 hive 中有一个单一但巨大的表，几乎总是使用主键列(例如，employee_id)进行查询。该表将非常庞大，每天插入数百万行，我想使用该字段上的分区进行快速查询。 I followed this post而且我知道分区仅适用于低基数字段，那么我如何才能实现使用 employee_id 列快速查询的目标？

我知道应该将具有非常高基数的 id 列用作分桶，但它对我在单个表上的查询性能没有帮助，对吗？

我认为，如果我可以使用像 hash(employee_id) 这样的东西作为分区，那将对我有很大帮助。这可能吗？我在关于 hive 的文档中看不到这样的东西。

总而言之，我想要的是快速查询结果:

select * from employee where employee_id=XXX

假设 employee 表有数十亿条记录，主键列 employee_id，其中按年、月、日等进行经典分区没有帮助。

提前致谢

最佳答案

将 ORC 与布隆过滤器结合使用:

    CREATE TABLE employee (
      employee_id bigint,
      name STRING
    ) STORED AS ORC 
    TBLPROPERTIES ("orc.bloom.filter.columns"="employee_id")
    ;

通过矢量化启用 PPD，使用 CBO 和 Tez:

    SET hive.optimize.ppd=true;
    SET hive.optimize.ppd.storage=true;
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled = true;
    SET hive.cbo.enable=true;
    set hive.stats.autogather=true;
    set hive.compute.query.using.stats=true;
    set hive.stats.fetch.partition.stats=true;
    set hive.execution.engine=tez;
    set hive.stats.fetch.column.stats=true;
    set hive.map.aggr=true;
    SET hive.tez.auto.reducer.parallelism=true;

引用:https://community.cloudera.com/t5/Community-Articles/Optimizing-Hive-queries-for-ORC-formatted-tables/ta-p/248164

在映射器和缩减器上调整适当的并行性:

--映射器示例:

 set tez.grouping.max-size=67108864;
 set tez.grouping.min-size=32000000;

-- reducer 的示例设置:

 set hive.exec.reducers.bytes.per.reducer=67108864; --decrease this to increase the number of reducers

更改这些数字以获得最佳性能。

关于sql - 高基数字段的 Hive 查询性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48295667/

sql - 高基数字段的 Hive 查询性能

上一篇：hadoop - Impala:如何查询具有不同模式的多个 Parquet 文件

下一篇：hadoop - Hive:合并配置设置不起作用