我编写了一个简单的 Scala 应用程序,它从 GCS 存储桶读取 parquet 文件。该应用程序使用:
- JDK 17
- Scala 2.12.17
- Spark SQL 3.3.1
- hadoop3-2.2.7 的 gcs 连接器
连接器取自 Maven,通过 sbt 导入(Scala 构建工具)。我没有使用最新的 2.2.9 版本,因为 issue .
应用程序在本地模式下运行得很好,所以我尝试切换到独立模式。
我所做的是以下步骤:
我尝试再次运行该应用程序并遇到此错误:
[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error] at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error] at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error] at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error] at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error] at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error] at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error] ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error] ... 24 more
不知何故,它无法检测连接器的文件系统:java.lang.ClassNotFoundException:未找到类 com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
我的 Spark 配置非常基本:
spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"
最佳答案
我发现 GCS hadoop 连接器的 maven 版本内部缺少依赖项。
我已经通过以下任一方式修复了它:
- 从此处下载连接器 https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage并在启动时提供 Spark 配置。 (但不建议在生产中使用,因为网站上有明确说明)
- 为连接器提供缺少的依赖项。
为了解决第二个选项,我解压了 gcs hadoop 连接器 jar 文件,查找 pom.xml,将依赖项复制到新的独立 xml 文件,然后使用 mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/
下载它们。命令
这是我创建的示例 pom.xml,请注意我使用的是 2.2.9 版本的连接器
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<name>TMP_PACKAGE_NAME</name>
<description>
jar dependencies of gcs hadoop connector
</description>
<!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
-->
<groupId>TMP_PACKAGE_GROUP</groupId>
<artifactId>TMP_PACKAGE_NAME</artifactId>
<version>0.0.1</version>
<dependencies>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client-jackson2</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.1-jre</version>
</dependency>
<dependency>
<groupId>com.google.oauth-client</groupId>
<artifactId>google-oauth-client</artifactId>
<version>1.34.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util-hadoop</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.auto.value</groupId>
<artifactId>auto-value-annotations</artifactId>
<version>1.10.1</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>google-extensions</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger-system-backend</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
希望这有帮助
关于scala - Hadoop 3 gcs-connector 无法在最新版本的 Spark 3 独立模式下正常工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74569023/