我知道这一定是个愚蠢的问题,但经过几个小时的谷歌搜索,我找不到答案。
在 csv 等纯文本格式中很容易理解分隔符的工作原理。而在 ORC 中,由于是二进制存储在 HDFS 中,那么字段的分隔符是什么?有人告诉我 ORC 中没有定界符,但我对这种说法深表怀疑。
即使是按行组存储,对于每个行组的一列,可以有多个数据字段,每个字段和下一个字段如何区分?每行如何与下一行分开?是否有分隔符来实现这一点?
感谢您提出任何意见!
最佳答案
没有分隔符。它使用 Stride/Stripes,
The body of the file is divided into stripes. Each stripe is self contained and may be read using only its own bytes combined with the file’s Footer and Postscript. Each stripe contains only entire rows so that rows never straddle stripe boundaries. Stripes have three sections: a set of indexes for the rows within the stripe, the data itself, and a stripe footer. Both the indexes and the data sections are divided by columns so that only the data for the required columns needs to be read.
引用:ORC
关于hadoop - ORC如何分隔字段?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40029538/