scala - 如何为 Scala Spark ETL 设置本地开发环境以在 AWS Glue 中运行?

标签 scala pyspark sbt aws-glue

我希望能够写 Scala在我的本地 IDE 中,然后将其部署到 AWS Glue 作为构建过程的一部分。但是我无法找到构建 GlueApp 所需的库。 AWS 生成的骨架。

aws-java-sdk-glue不包含导入的类,我在其他任何地方都找不到这些库。虽然它们一定存在于某个地方,但也许它们只是这个库的 Java/Scala 端口:aws-glue-libs

来自 AWS 的模板 Scala 代码:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    // @type: DataSource
    // @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
    // @return: datasource0
    // @inputs: []
    val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
    // @type: ApplyMapping
    // @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
    // @return: applymapping1
    // @inputs: [frame = datasource0]
    val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
    // @type: DataSink
    // @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
    // @return: datasink2
    // @inputs: [frame = applymapping1]
    val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
    Job.commit()
  }
}

build.sbt我已经开始为本地构建组合起来:
name := "aws-glue-scala"

version := "0.1"

scalaVersion := "2.11.12"

updateOptions := updateOptions.value.withCachedResolution(true)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

AWS Glue Scala API 的文档似乎概述了 AWS Glue Python 库中可用的类似功能。那么也许所需要做的就是下载并构建 PySpark AWS Glue 库并将其添加到类路径中?也许可能因为 Glue python 库 uses Py4J .

最佳答案

关于scala - 如何为 Scala Spark ETL 设置本地开发环境以在 AWS Glue 中运行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49254077/

相关文章:

Scala 宏分配解构函数的参数

android - Scala Android AsyncTask 不会调用 publishProgress 回调, "onProgressUpdate"

pyspark - 应用 pyspark ALS 的 "recommendProductsForUsers"时出现 StackOverflow 错误(尽管可用 >300GB RAM 的集群)

java - 在 SBT 构建中添加托管 libraryDependencies(任意长度)列表

scala - ClassNotFoundException : com. 数据 block .spark.csv.DefaultSource

scala - 我应该如何在 spark 文本文件中表达 hdfs 路径?

python - 使用 Python 类中的方法作为 PySpark 用户定义函数

scala - 从 Spark 错误更新到 CosmosDB

scala - 如何发布具有对只读私有(private)凭据的构建时访问权限的 SBT 项目?

docker - 缓存 GitLab CI 构建阶段之间的依赖关系