hadoop - "the container format for fields in a row"对文件格式意味着什么？

来自 Hadoop:权威指南:

There are two dimensions that govern table storage in Hive: the row format and the file format.

The row format dictates how rows, and the fields in a particular row, are stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word for a Serializer-Deserializer. When acting as a deserializer, which is the case when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. When used as a serializer, which is the case when performing an INSERT or CTAS (see “Importing Data” on page 500), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

The file format dictates the container format for fields in a row. The simplest format is a plain-text file, but there are row-oriented and column-oriented binary formats avail‐ able, too.

“行中字段的容器格式”对于文件格式意味着什么？

文件格式与行格式有何不同？

最佳答案

另请阅读有关 的指南 SerDe

Hive 使用 SerDe(和 FileFormat)来读写表行。

HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

您可以使用自定义 SerDe 或使用 native SerDe 创建表。如果未指定 ROW FORMAT 或指定了 ROW FORMAT DELIMITED，则使用 native SerDe

File Format表示文件容器，可以是Text，也可以是ORC、Parquet等二进制格式。

行格式可以是简单的分隔文本或相当复杂的 regexp/template-based或 JSON 例如。

考虑文本文件中的 JSON 格式记录:

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE

或者序列文件中的JSON记录:

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE

实际上一切都是 Java 类。让初学者非常困惑的是 DDL 中可能存在快捷方式，这允许您编写 DDL 而无需为所有格式指定又长又复杂的类名。有些类在 DDL 语言中没有嵌入相应的快捷方式。

STORED AS SEQUENCEFILE 是

的快捷方式

STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileOutputFormat'

这两个类决定了如何读写文件容器。

并且这个类决定了应该如何存储和读取行(JSON):

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'

现在 DDL 具有行格式和文件格式，没有快捷方式:

ROW FORMAT SERDE
    'org.apache.hive.hcatalog.data.JsonSerDe'
   STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.SequenceFileInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.mapred.SequenceFileOutputFormat'

为了更好地理解差异，请查看 SequenceFileOutputFormat class (extends FileOutputFormat) 和 JsonSerDe (implements SerDe)您可以深入挖掘并尝试理解实现的方法和基类/接口(interface)，查看源代码，在JsonSerDe class 中序列化和反序列化方法。 .

而“行中字段的容器格式”是上述DDL中提到的FileInputFormat加上FileOutputFormat。如果是 ORC file例如，您不能指定行格式(分隔或其他 SerDe)。 ORC 文件规定 OrcSerDe 将仅用于这种类型的文件容器，它有自己的内部格式来存储行和列。实际上，您可以在 Hive 中编写 ROW FORMAT DELIMITED STORED AS ORC，但在这种情况下将忽略行格式分隔。

关于hadoop - "the container format for fields in a row"对文件格式意味着什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56137811/

hadoop - "the container format for fields in a row"对文件格式意味着什么？

上一篇：hadoop - 如何根据连接列的条件连接配置单元表

下一篇：date - 带有 case 语句的 Hive 查询