java - 如何在本地使用java使用spark连接到Google大查询?

标签 java apache-spark google-bigquery

我正在尝试使用 java 中的 Spark 连接到 Google 大查询,但我无法找到相同的准确文档。

我尝试过:https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

https://github.com/GoogleCloudPlatform/spark-bigquery-connector#compiling-against-the-connector

我的代码:

sparkSession.conf().set("credentialsFile", "/path/OfMyProjectJson.json");
Dataset<Row> dataset = sparkSession.read().format("bigquery").option("table","myProject.myBigQueryDb.myBigQuweryTable")
          .load();
dataset.printSchema();

但这会引发异常:

Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.util.ServiceLoader.fail(ServiceLoader.java:232)
    at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
    at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
    at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:614)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
    at com.mySparkConnector.getDataset(BigQueryFetchClass.java:12)


Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
    at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:285)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:91)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:86)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:159)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:29)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
    ... 15 more

我的 json 文件包含 project_id 我尝试寻找可能的解决方案,但找不到任何解决方案,因此请帮助我找到此异常的解决方案,或者任何有关如何使用 Spark 连接到大查询的文档。

最佳答案

我在 Airflow 中使用 DataProcPySparkOperator 运算符时遇到了完全相同的错误。修复方法是提供

dataproc_pyspark_jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'

而不是

dataproc_pyspark_jars='gs://spark-lib/bigquery/spark-bigquery-latest.jar'

我想在你的情况下它应该作为命令行参数传递

--jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

关于java - 如何在本地使用java使用spark连接到Google大查询?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59195716/

相关文章:

Java 8 : map. 合并时间复杂度

java - 创建新节点会不断更改所有节点(Java)

java - 我如何以编程方式使 Liferay Portlet 进入全屏模式

java - Spark 作业抛出 "java.lang.OutOfMemoryError: GC overhead limit exceeded"

apache-spark - 针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗?

sql - 大查询从字符串中提取日期

sql - 如何仅从 BigQuery 中的字符串中删除字母?

java - 使用 JNDI 在 Tomcat 中共享 servlet session 对象和数据

Scala/Spark 版本兼容性

google-bigquery - BigQuery 作业状态卡在 "RUNNING"