java - 如何使用JDK 1.8将Hadoop AWS jar添加到Spark 2.4.5?

标签 java amazon-web-services apache-spark hadoop pyspark

我遇到了一个错误:java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found,偶然发现了有效的解决方案here
但是,在答案后给出的注释中,作者指出以下内容:

com.amazonaws:aws-java-sdk-pom:1.11.760 : depends on jdk version hadoop:hadoop-aws:2.7.0: depends on your hadoop version s3.us-west-2.amazonaws.com: depends on your s3 location


因此,当我运行以下命令时:
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.8.0_242,org.apache.hadoop:hadoop-aws:2.8.5

我遇到以下错误:
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.amazonaws#aws-java-sdk-pom;1.8.0_242: not found]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

我更改命令的原因如下:
  • JDK版本:
  • (app-root) java -version
    openjdk version "1.8.0_242"
    OpenJDK Runtime Environment (build 1.8.0_242-b08)
    OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
    
  • Pyspark版本:2.4.5
  • Hadoop版本:2.8.5

  • 如何解决此错误,并以正确的依赖项启动pyspark shell,以便从S3中读取文件?

    最佳答案

    这对我来说适用于spark:2.4.4-hadoop2.7:

        --conf spark.executor.extraClassPath=/hadoop-aws-2.7.3.jar:/aws-java-sdk-1.7.4.jar --driver-class-path /hadoop-aws-2.7.3.jar:/aws-java-sdk-1.7.4.jar
    

    关于java - 如何使用JDK 1.8将Hadoop AWS jar添加到Spark 2.4.5?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62758620/

    相关文章:

    amazon-web-services - 如何清理已终止的 AWS EMR 集群列表?

    scala - 在 Spark Catalyst 中从一个逻辑计划转换为另一个逻辑计划

    hadoop - 将 Spark 的输出合并到一个文件中

    java - 从哪里开始使用 Java 中的 Swing 制作动画时钟?

    java - 安卓奇巧 : fileChooser + fileTransfer cordova plugin not working

    java - 在java中屏蔽电子邮件地址

    mongodb - MongoDB Amazon Web Service-是否实际安装了它?

    amazon-web-services - Kafka Connect 与 AWS Hadoop 实例的托管

    java - 如何从 Spring Boot 调用用户定义的 sql 函数?

    apache-spark - java.io.IOException:文件系统已关闭