我很难从http源(例如csv,...)读取ApacheSpark DataFrame。
HDFS和本地文件有效。
通过使用以下命令启动spark-shell,还设法使AWS S3正常运行:
spark-shell --packages org.apache.hadoop:hadoop-core:1.2.1
然后像这样更新hadoop conf:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "****")
hadoopConf.set("fs.s3.awsSecretAccessKey", "****")
恕我直言,必须存在一个
fs.http.impl
和fs.https.impl
参数以及org.apache.hadoop.fs.FileSystem
的各个实现。但是我什么都没找到。很难相信没有对HTTP(S)的支持,因为这在Pandas和R中毫无疑问。
有什么想法我想念的吗?顺便说一句,这是失败的代码块:
val df=spark.read.csv("http://raw.githubusercontent.com/romeokienzler/developerWorks/master/companies.csv")
出现以下错误:
17/06/26 13:21:51 WARN DataSource: Error while looking for metadata directory. java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352) ... 48 elided
最佳答案
这是重复的:
How to use Spark-Scala to download a CSV file from the web?
只是将答案复制并粘贴到此处:
val content = scala.io.Source.fromURL("http://ichart.finance.yahoo.com/table.csv?s=FB").mkString
val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(list)
val df = rdd.toDF
关于java - ApacheSpark从http来源(例如csv等)读取数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44758616/