java - 具有 DataFrame API 的 Apache Spark MLlib 在 createDataFrame() 或 read().csv(...) 时给出 java.net.URISyntaxException

标签 java apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

在独立应用程序中(在 java8、Windows 10 上运行,使用 Spark-xxx_2.11:2.0.0 作为 jar 依赖项),下一个代码给出错误:

/* this: */
Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList(
    new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)),
    new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2))
  ), LabeledPoint.class);

/* or this: */
/* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv",
            "C:/files/project/file.csv", "file:/C:/files/project/file.csv",
            "file:///C:/files/project/file.csv", "/file.csv" */
Dataset<Row> logData = spark_session.read().csv(logFile);

异常(exception):

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/files/project/spark-warehouse
               at org.apache.hadoop.fs.Path.initialize(Path.java:206)
               at org.apache.hadoop.fs.Path.<init>(Path.java:172)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
               at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
               at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
               at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
               at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
               at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
               at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
               at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
               at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
               at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
               at <call in my line of code>

如何将 csv 文件加载到 Dataset<Row>来自java代码?

最佳答案

文件系统路径存在一些问题。请参阅 jira https://issues.apache.org/jira/browse/SPARK-15899 。要解决此问题,您可以在 SparkSession 中设置“spark.sql.warehouse.dir”,如下所示。

SparkSession spark = SparkSession
  .builder()
  .appName("JavaALSExample")
  .config("spark.sql.warehouse.dir", "/file:C:/temp")
  .getOrCreate();

关于java - 具有 DataFrame API 的 Apache Spark MLlib 在 createDataFrame() 或 read().csv(...) 时给出 java.net.URISyntaxException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38746340/

相关文章:

java - 数据库添加项目到 ListView ?

java - 如何从java连接到url

scala - Spark mapWithState API 说明

apache-spark - 用于实时分析的 Cassandra + Spark

java - java中自动转换是如何工作的?

java - 为什么 Eclipse JDT Null-Checking 尊重 Apache Commons Validate

scala - 比较 RDD 的子集

scala - Spark Scala:无法导入sqlContext.implicits._

postgresql - 如何加速 spark df.write jdbc 到 postgres 数据库?

java - Spark Java - 将 csv 内的 json 转换为 map