scala - Spark : save and load machine learning model on s3

标签 scala apache-spark amazon-s3 kryo

我想在 s3 上保存和加载机器学习模型。

我做到了:

val credentials = new ProfileCredentialsProvider()
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", credentials.getCredentials.getAWSAccessKeyId)
hadoopConf.set("fs.s3.awsSecretAccessKey", credentials.getCredentials.getAWSSecretKey)

TrainValidationSplitModel.load(s"s3://model_path")

当我在本地运行它时它就可以工作。

但是,当我在集群中运行它时,出现以下错误:

Serialization trace:
fields (org.apache.spark.sql.types.StructType)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:366)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:307)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.types.StructField[]
Note: To register this class use: kryo.register(org.apache.spark.sql.types.StructField[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:488)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:76)
... 10 more

您可能会说:“很简单,您只需使用 kryo.register(SomeClass.class); 注册 org.apache.spark.sql.types.StructField 类即可”

但是,在注册了近十五个类(class)之后。 Kryo 要求我注册一个私有(private)类(访问权限仅限于 Spark 包)。

如何解决这个问题?

最佳答案

该错误与保存和加载模型无关。

这是由 spark.kryo.registrationRequired 引起的,请将配置中的某处设置为 true。如果是,it behaves as follows

Whether to require registration with Kryo. If set to 'true', Kryo will throw an exception if an unregistered class is serialized. If set to false (the default), Kryo will write unregistered class names along with each object. Writing class names can cause significant performance overhead, so enabling this option can enforce strictly that a user has not omitted classes from registration.

我个人的建议是仅将其用于诊断,并在实际运行应用程序时禁用它。

关于scala - Spark : save and load machine learning model on s3,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51873203/

相关文章:

apache-spark - Spark提交失败,返回码为13,例如wordCount

javascript - s3.putObject 返回 400 bad request 错误

scala - 将 scala 2.10 future 转换为 scalaz.concurrent.Future//任务

scala - 从案例类中获取字段名称列表

python - 为什么在 PySpark 中有两种读取 CSV 文件的选项?我应该使用哪一个?

java - 无法执行多个 Spark 作业 "Initial job has not accepted any resources"

node.js - 如何在 Nodejs Lambda 中将数据从 S3 存储桶加载到 Elastic Search

ruby-on-rails - ActiveStorage + AWS 文件加密

scala - 如何在 Eclipse 中将 Swing 与 Scala 2.11 一起使用?

scala - 有没有更好的方法对 RDD[Array[Double]] 进行归约操作