java - Spark : strange behavoir of partitionBy, 字段变得不可读

标签 java scala apache-spark

我有一个 csv 记录,并作为数据框导入:

--------------------------- 
name | age | entranceDate | 
---------------------------
Tom  | 12  | 2019-10-01   | 
---------------------------
Mary | 15  | 2019-10-01   | 
---------------------------

当我使用时:

String[] partitions =
new String[] {
  "name",
  "entranceDate"
};

df.write()
.partitionBy(partitions)
.mode(SaveMode.Append)
.parquet(parquetPath);

它将我的 Parquet 写入文件(.parquet)。但奇怪的是,当我再次尝试从 Parquet 上读取时:

public static StructType createSchema() {
    final StructType schema = DataTypes.createStructType(Arrays.asList(
            DataTypes.createStructField("name", DataTypes.StringType, false),
            DataTypes.createStructField("age", DataTypes.StringType, false),
            DataTypes.createStructField("entranceDate", DataTypes.StringType, false)
    ));
    return schema;
}


sqlContext.read()
    .schema(createSchema())
    .parquet(pathToParquet);
    .show()

字段name变得不可读:

|          name |  age | entranceDate|
+--------------------+----+
|?F...|Tom| 2019-10-01 | 
|?F...|Mary| 2019-10-01 |
+--------------------+

这怎么可能?但我尝试过,如果我不放置 .partitionBy(partitions) 行,我可以毫无问题地读取。

有人可以解释一下根本原因是什么吗?我找了好久没找到原因。

编辑:我尝试检索“名称”字段(row.getString(0)),我得到如下值,但无法读取它:

?F??m???9??A?Aorg/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter??:??A?Aorg.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter??!:??A?Aorg/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter??7:??A?Aorg/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter?-??9????Q:??A?Forg/apache/spark/sql/catalyst/expressions/BaseGenericInternalRow$class??h:??,??A?Forg.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class?????:??A?Forg/apache/spark/sql/catalyst/expressions/BaseGenericInternalRow$class]??6x]???:???:???]??:??????x?:??????b?x?:?????c?x?:?????r?x?:?????c?x?:?????1c?x?:???????x?:?????.??x?:?????Nc?x?:?????]c?x?:????????x?:???????x?:????????x?:???????x?:????????x?:???????xy?x????:??]??X;??T???????:???:??????:???5??x?:???5?.???:???x????:??K0?i?x?i?x??6x6x??6x6x???:??A?Eorg/apache/spark/sql/catalyst/trees/TreeNode$$anonfun$transformDown$2??
;???;?A?Eorg.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2????#;??A?Eorg/apache/spark/sql/catalyst/trees/TreeNode$$anonfun$transformDown$2???j?v9??:???:??:???:??7;??9;???<?;;??>;?????H;???"?@?x?i?xux?]?E;???"?@?x?i?xux????:??;??????:???5??x[;???5?.??[;???x???c;??K0?i?x?i?x??6x6x??6x6x???j?v9?h;???:?h;??[;??s;??u;???<?w;??z;??????;???"?egx?i?xux?]??;???"?egx?i?xux???h;???;??????:???5??b?x?;???5?.???;???b?x????;??K0?i?x?i?x??6x6x??6x6x???j?v9??;???:??;??;??;??;???<??;??;??????;???"?o_x?i?xux?]??;???"?o_x?i?xux????;??<??????:???5?c?x?;???5?.???;??c?x????;??K0?i?x?i?x??6x6x??6x6x???j?v9??;???:??;???;???;???;???<??;???;??????;???"??lx?i?xux?]??;???"??lx?i?xux????;??H<??????:???5?r?x<???5?.??<??r?x???<??K0?i?x?i?x??6x6x??6x6x???j?v9?<???:?<??<??'<??)<???<?+<??.<?????8<???"?;_x?i?xux?]?5<???"?;_x?i?xux???<??<??????:???5?c?xK<???5?.??K<??c?x???S<??K0?i?x?i?x??6x6x??6x6x???j?v9?X<???:?X<??K<??c<??e<???<?g<??j<?????t<???"?H_x?i?xux?]?q<???"?H_x?i?xux???X<???<??????:???5?1c?x?<???5?.???<??1c?x????<??K0?i?x?i?x??6x6x??6x6x???j?v9??<???:??<??<??<??<???<??<??<??????<???"?|_x?i?x?/x?]??<???"?|_x?i?x?/x????<???<??????:???5???x?<???5?.???<???x????<??K0?i?x?i?x??6x6x??6x6x???j?v9??<???:??<???<???<???<???<??<???<??????<???"??_x?i?x?/x?]??<???"??_x?i?x?/x????<??8=??????:???5?.??x?<???5?.???<??.??x???=??K0?i?x?i?x??6x6x??6x6x???j?v9?=???:?=???<??=??=???<?=??=?????(=???"?T_x?i?xux?]?%=???"?T_x?i?xux???=??t=??????:???5?Nc?x;=???5?.??;=??Nc?x???C=??K0?i?x?i?x??6x6x??6x6x???j?v9?H=???:?H=??;=??S=??U=???<?W=??Z=?????d=???"?{lx?i?xux?]?a=???"?{lx?i?xux???H=??=??????:???5?]c?xw=???5?.??w=??]c?x???=??K0?i?x   

最佳答案

由于 partitionBy 保存文件的方式,列变得困惑。 partitionBy 子句中指定的所有列都存储为目录结构。 在你的情况下,它会像:

<<root-path>>/name=???/entranceDate=???/???.parquet

这强制按照目录的 L->R 顺序在架构末尾指定分区列。

因此,在读取 Parquet 文件时,如果您将架构指定为[age, name,entryDate],它应该会产生正确的值。

关于java - Spark : strange behavoir of partitionBy, 字段变得不可读,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58060308/

相关文章:

java - Glassfish部署项目的依赖库

java - 旋转的矩形改变位置

java - 将 Akka Iterable 转换为 java.lang.Iterable?

mongodb - Casbah Mongo 作为 scala 数组 : is this the most elegant way?

scala - 在 SBT 中无需编译即可获取完整的类路径

apache-spark - 集群部署模式下的 spark-submit 将应用程序 ID 获取到控制台

python - Spyder IDE 看不到 pyspark 模块

scala - 如何在Spark中的执行器之间同步功能以避免在写入Elastic时并发

java - 在 Java 中读取/打开文本文件

java - Mockito 在 Spy 上使用 doAnswer