apache-spark - 从AWS S3读取pyspark文件不起作用

标签 apache-spark hadoop amazon-s3 pyspark

我使用brew安装了spark和hadoop:

brew info hadoop #=> hadoop: stable 3.1.2
brew info apache-spark #=> apache-spark: stable 2.4.4

我现在正尝试加载s3上托管的csv文件,尝试了许多不同的方法,但均未成功(这是其中一种):
import pyspark

conf = pyspark.SparkConf()

sc = pyspark.SparkContext('local[4]', conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "key here")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "key here")

sql = pyspark.SQLContext(sc)
df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')

我收到此错误:
19/09/17 14:34:52 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/ingestion/clients/aws_s3.py", line 51, in <module>
    df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 476, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
        ... 30 more

看来它与我的AWS S3凭证有关,但不确定如何设置。 (当前我的敬畏凭证在我的bash_profile中)请帮助。

最佳答案

也许thisisprabin的解决方案here可能有帮助?

将以下内容添加到此文件“hadoop / etc / hadoop / core-site.xml”

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>***</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>***</value>
</property>


sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/

sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/

关于apache-spark - 从AWS S3读取pyspark文件不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57980317/

相关文章:

java - 根据 DataStax Enterprise 的运行时类路径构建 Spark 应用程序

apache-spark - 如何以编程方式从提交 ID 或驱动程序 ID 获取应用程序 ID

forms - 我正在使用 Amazon S3,并且想要创建一个在提交时发送电子邮件的表单。我怎样才能做到这一点?

sql - 实时对大量数据进行版本控制

javascript - Amazon S3 上传图像 - 跟踪上传进度 - Angular JS

python - 使用 python 获取 s3 存储桶时出现问题

apache-spark - 我可以动态更改 SparkContext.appName 吗?

scala - 无法更改 RDD 的存储级别

hadoop - Hbase Scan 返回超出范围的数据

hadoop - 使用 hadoop 快速编码我的视频