hadoop - Hive 查询执行计划

这是我的配置单元查询

Insert into schemaB.employee partition(year) 
select * from schemaA.employee;

下面是这个查询产生的查询执行计划。

hive> explain <query>;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: employee
            Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col3 (type: int)
                sort order: +
                Map-reduce partition columns: _col3 (type: int)
                Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int)
      Reduce Operator Tree:
        Extract
          Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: schemaB.employee

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            year 
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: schemaB.employee

  Stage: Stage-2
    Stats-Aggr Operator

我有两个与查询执行计划相关的问题:

为什么查询计划中有一个reduce步骤？在我的理解中，它需要做的就是将数据从一个HDFS位置复制到另一个位置，这可以单独通过映射器来实现。 reduce 步骤是否与表中存在的分区有关？
阶段 2 中的Stats-Aggr Operator 步骤是什么？我找不到对此进行解释的相关文档。

最佳答案

这回答了这两个问题。
默认情况下会自动收集统计信息，为此需要减少步骤。

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

hive.stats.autogather

Default Value: true

Added In: Hive 0.7 with HIVE-1361

A flag to gather statistics automatically during the INSERT OVERWRITE command.

关于hadoop - Hive 查询执行计划，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43448218/

hadoop - Hive 查询执行计划

上一篇：hadoop 2.6.2，mkdir : Couldn't create proxy provider null

下一篇：hadoop - 如何使用 ResourceManager HA wrt Hortowork 的 HDP 将 MR 作业提交到 YARN 集群？