hadoop - Hive 查询控制流？

Hive 查询的控制流程是什么？

比方说，我想加入 Emp_Table 和 Dept_Table，

流程如何进行？

它从元存储中的哪个表中获取所有相关信息？

比如， 1) Emp_Table 对应的文件在哪里？ (HDFS 位置) 2) 表 Emp_Table 的字段名称是什么？ 3) 包含 Emp_Table 数据的文件中的分隔符是什么？ 4)如何对数据进行分桶或分区，在这种情况下，从哪里(元存储表名称)以及如何(查询)给出 HDFS 文件夹位置？

最佳答案

流程是这样的:

第一步: Hive 客户端触发查询(CLI 或使用 JDBC、ODBC 或 Thrift 或 webUI 的某些外部客户端)。

第 2 步: 编译器接收查询并连接到 Metastore。

第 3 步: 编译阶段开始。

Parser

将查询转换为解析树 表示。 ANTLR用于生成抽象语法树(AST)。

Semantic analyzer

编译器根据元存储在输入和输出表上提供的信息构建逻辑计划。编译器还会检查类型兼容性，并在此阶段通知编译时语义错误。

QBT creation

在此步骤中，将 AST 转换为中间表示，称为 query block(QB) tree。

Logical plan generator

在这一步，编译器将语义分析器的逻辑计划写入逻辑操作树。

Optimization

这是编译阶段最重要的部分，因为整个 DAG 优化 系列都发生在这个阶段。它涉及以下任务:

Logical optimization

Column pruning

Predicate pushdown

Partition pruning

Join optimization

Grouping(and regrouping)

Repartitioning

Conversion of logical plan into physical plan by physical plan generator

Creation of final DAG workflow of MapReduce by physical plan generator

第 4 步: 执行引擎获取编译器输出以在 Hadoop 平台上执行它们。它涉及以下任务:

A MapReduce task first serializes its part of the plan into a plan.xml file.

plan.xml file is then added to the job cache for the task and the instances of ExecMapper and ExecReducer are spawned using Hadoop.

Each of these classes deserializes the plan.xml file and executes the relevant part of the task.

The final results are stored in a temporary location and at the completion of the entire query the results are moved to the table if it was inserts or partitions. Otherwise returned to the calling program at a temporary location.

Note : All the tasks are executed in the order of their dependencies. Each is only executed if all of its prerequisites have been executed.

要了解 Metastore 表及其字段，您可以查看 Metastore 的 MR 图:

enter image description here

HTH

关于hadoop - Hive 查询控制流？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17090022/

hadoop - Hive 查询控制流？

上一篇：java - Hadoop:如何获取 CombineFileInputFormat 中的每个文件路径？

下一篇：scala - Hadoop 作业在 java.lang.ClassNotFoundException 上失败