跑过spark-csv README有这样的示例 Java 代码 import org.apache.spark.sql.SQLContext;
导入 org.apache.spark.sql.types.*;
SQLContext sqlContext = new SQLContext(sc);
StructType customSchema = new StructType(
new StructField("year", IntegerType, true),
new StructField("make", StringType, true),
new StructField("model", StringType, true),
new StructField("comment", StringType, true),
new StructField("blank", StringType, true));
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("cars.csv");
df.select("year", "model").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv");
它不是开箱即用的,所以经过一些争论,我通过更改不正确的
FooType
来编译它。语法为 DataTypes.FooType
并将 StructFields 作为 new StructField[]
传递;编译器为 metadata
请求了第四个参数在 StructField
的构造函数中但是我很难找到有关它的含义的文档(javadocs 描述了它的用例,但并没有真正确定如何在 StructField 构造期间决定传入什么)。使用以下代码,它现在会一直运行,直到出现任何副作用方法,例如 collect()
:JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
// Read features.
System.out.println("Reading features from " + args[0]);
StructType featuresSchema = new StructType(new StructField[] {
new StructField("case_id", DataTypes.StringType, false, null),
new StructField("foo", DataTypes.DoubleType, false, null)
});
DataFrame features = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(featuresSchema)
.load(args[0]);
for (Row r : features.collect()) {
System.out.println("Row: " + r);
}
我收到以下异常:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:202)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
at scala.collection.immutable.HashSet.elemHashCode(HashSet.scala:65)
at scala.collection.immutable.HashSet.computeHash(HashSet.scala:74)
at scala.collection.immutable.HashSet.$plus(HashSet.scala:56)
at scala.collection.immutable.HashSet.$plus(HashSet.scala:59)
at scala.collection.immutable.Set$Set4.$plus(Set.scala:127)
at scala.collection.immutable.Set$Set4.$plus(Set.scala:121)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:24)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
at scala.collection.SetLike$class.map(SetLike.scala:93)
at scala.collection.AbstractSet.map(Set.scala:47)
at org.apache.spark.sql.catalyst.expressions.AttributeSet.foreach(AttributeSet.scala:114)
at scala.collection.TraversableOnce$class.size(TraversableOnce.scala:105)
at org.apache.spark.sql.catalyst.expressions.AttributeSet.size(AttributeSet.scala:56)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:307)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:282)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:56)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
...
知道出了什么问题吗?
最佳答案
似乎 README 已经过时了,需要对 Java 示例进行一些重要的编辑。我找到了实际的 JIRA which added the metadata field并且它指向使用默认值 Map.empty
Scala 案例的值,并且编写文档的人必须将 Scala 直接转换为 Java,尽管输入参数缺少相同的默认值。
在 1.5 branch of SparkSQL's code我们可以看到它引用了 metadata.hashCode()
没有检查,这就是导致 NullPointerException
的原因. Metadata.empty()的存在方法结合有关在 Scala 中默认使用空映射的讨论似乎暗示正确的实现是继续并通过 Metadata.empty()
如果你不在乎的话。修改后的例子应该是:
SQLContext sqlContext = new SQLContext(sc);
StructType customSchema = new StructType(new StructField[] {
new StructField("year", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("make", DataTypes.StringType, true, Metadata.empty()),
new StructField("model", DataTypes.StringType, true, Metadata.empty()),
new StructField("comment", DataTypes.StringType, true, Metadata.empty()),
new StructField("blank", DataTypes.StringType, true, Metadata.empty())
});
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(customSchema)
.option("header", "true")
.load("cars.csv");
df.select("year", "model").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv");
关于apache-spark - 使用带有数据帧的 spark-csv 获取 NullPointerException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34388705/