java - 星火 java : Creating a new Dataset with a given schema

标签 java scala apache-spark apache-spark-dataset

我有这段代码在 scala 中运行良好:

val schema = StructType(Array(
        StructField("field1", StringType, true),
        StructField("field2", TimestampType, true),
        StructField("field3", DoubleType, true),
        StructField("field4", StringType, true),
        StructField("field5", StringType, true)
    ))

val df = spark.read
    // some options
    .schema(schema)
    .load(myEndpoint)

我想用 Java 做一些类似的事情。所以我的代码如下:

final StructType schema = new StructType(new StructField[] {
     new StructField("field1",  new StringType(), true,new Metadata()),
     new StructField("field2", new TimestampType(), true,new Metadata()),
     new StructField("field3", new StringType(), true,new Metadata()),
     new StructField("field4", new StringType(), true,new Metadata()),
     new StructField("field5", new StringType(), true,new Metadata())
});

Dataset<Row> df = spark.read()
    // some options
    .schema(schema)
    .load(myEndpoint);

但这给了我以下错误:

Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType@37c5b8e8 (of class org.apache.spark.sql.types.StringType)

我的模式似乎没有任何问题,所以我真的不知道这里的问题是什么。

spark.read().load(myEndpoint).printSchema();
root
 |-- field5: string (nullable = true)
 |-- field2: timestamp (nullable = true)
 |-- field1: string (nullable = true)
 |-- field4: string (nullable = true)
 |-- field3: string (nullable = true)

schema.printTreeString();
root
 |-- field1: string (nullable = true)
 |-- field2: timestamp (nullable = true)
 |-- field3: string (nullable = true)
 |-- field4: string (nullable = true)
 |-- field5: string (nullable = true)

编辑:

这是一个数据示例:

spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5                                                         |field2             |field1       |field4        |field3   |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 16:54:50|SOME_VALUE   |SOME_VALUE    |0.0      |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 16:58:50|SOME_VALUE   |SOME_VALUE    |50.0     |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 17:00:50|SOME_VALUE   |SOME_VALUE    |20.0     |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"}  |2018-01-20 18:04:50|SOME_VALUE   |SOME_VALUE    |10.0     |
 ...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+

最佳答案

使用 Datatypes 类中的静态方法和字段代替构造函数在 Spark 2.3.1 中为我工作:

    StructType schema = DataTypes.createStructType(new StructField[] {
            DataTypes.createStructField("field1",  DataTypes.StringType, true),
            DataTypes.createStructField("field2", DataTypes.TimestampType, true),
            DataTypes.createStructField("field3", DataTypes.StringType, true),
            DataTypes.createStructField("field4", DataTypes.StringType, true),
            DataTypes.createStructField("field5", DataTypes.StringType, true)
    });

关于java - 星火 java : Creating a new Dataset with a given schema,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51635553/

相关文章:

scala - 将 .csv 文件读入 scala,同时保留部分结构

scala - 动态混合特征

java - 如何连接两个 Parquet 数据集?

hadoop - 为什么 ./bin/spark-shell 给出 WARN NativeCodeLoader : Unable to load native-hadoop library for your platform?

scala - 如何使用CrossValidator在不同模型之间进行选择

java - java.lang.NoSuchMethodError:com.google.common.collect.ImmutableSet.toImmutableSet(),即使使用 'com.google.guava:guava:24.0-android'

Java - 模式匹配奇怪的行为

java - 多个 Applet - stop() 和 destroy()

java - 单次迭代 => 从 Java 到 Scala 的多个输出集合

java - 检测内存不足错误