java - newAPIHadoopRDD 任务不可序列化

标签 java apache-spark

我不断收到任务不可序列化错误。我在 Java Spark 应用程序中使用 mongo-hadoop 连接器。

错误如下所示

16/10/12 17:43:34 INFO SparkContext: Created broadcast 0 from newAPIHadoopRDD at DataframeExample.java:47
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:314)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:313)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.map(RDD.scala:313)
    at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:93)
    at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:47)
    at com.hbfinance.DataframeExample.run(DataframeExample.java:54)
    at com.hbfinance.DataframeExample.main(DataframeExample.java:88)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: com.hbfinance.DataframeExample
Serialization stack:
    - object not serializable (class: com.hbfinance.DataframeExample, value: com.hbfinance.DataframeExample@1f3165e7)
    - field (class: com.hbfinance.DataframeExample$1, name: this$0, type: class com.hbfinance.DataframeExample)
    - object (class com.hbfinance.DataframeExample$1, com.hbfinance.DataframeExample$1@1866da85)
    - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
    - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 22 more

这是我的代码:

package com.hbfinance;

public class DataframeExample {

    public void run() {
        JavaSparkContext sc = new JavaSparkContext(new SparkConf());
        // Set configuration options for the MongoDB Hadoop Connector.
        Configuration mongodbConfig = new Configuration();
        // MongoInputFormat allows us to read from a live MongoDB instance.
        // We could also use BSONFileInputFormat to read BSON snapshots.
        mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");

        // MongoDB connection string naming a collection to use.
        // If using BSON, use "mapred.input.dir" to configure the directory
        // where BSON files are located instead.
        mongodbConfig.set("mongo.input.uri",
          "mongodb://hadoopUser:Pocup1ne9@localhost:27017/hbdata.ppt_logs");
        // mongodbConfig.set("mongo.input.uri",
        //   "mongodb://hadoopUser:Pocup1ne9@localhost:27017/hbdata.ppa_logs");
        // mongodbConfig.set("mongo.input.uri",
        //   "mongodb://hadoopUser:Pocup1ne9@localhost:27017/hbdata.dd_logs");
        // mongodbConfig.set("mongo.input.uri",
        //   "mongodb://hadoopUser:Pocup1ne9@localhost:27017/hbdata.fav_logs");
        // mongodbConfig.set("mongo.input.uri",
        //   "mongodb://hadoopUser:Pocup1ne9@localhost:27017/hbdata.pps_logs");

        // Create an RDD backed by the MongoDB collection.
        JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
          mongodbConfig,            // Configuration
          MongoInputFormat.class,   // InputFormat: read from a live cluster.
          Object.class,             // Key class
          BSONObject.class          // Value class
        );

        JavaRDD<AppLog> logs = documents.map(

          new Function<Tuple2<Object, BSONObject>, AppLog>() {

              public AppLog call(final Tuple2<Object, BSONObject> tuple) {
                  AppLog log = new AppLog();
                  BSONObject header =
                    (BSONObject) tuple._2().get("headers");

                  log.setTarget((String) header.get("target"));
                  log.setAction((String) header.get("action"));

                  return log;
              }
          }
        );

        SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

        DataFrame logsSchema = sqlContext.createDataFrame(logs, AppLog.class);
        logsSchema.registerTempTable("logs");

        DataFrame groupedMessages = sqlContext.sql(
          "select target, action, Count(*) from logs group by target, action");
          // "SELECT to, body FROM messages WHERE to = \"eric.bass@enron.com\"");



        groupedMessages.show();

        logsSchema.printSchema();
    }

    public static void main(final String[] args) {
        new DataframeExample().run();
    }

}

最佳答案

您的类DataframeExample必须是可序列化的。添加实现可序列化并且它将起作用。

为什么?

在 run () 函数中创建的匿名类有一个指向外部类的指针 -so DataframeExample。 Spark 必须序列化这些匿名类,因此它也尝试序列化外部类,但没有成功,因为该类没有实现 Serialized

关于java - newAPIHadoopRDD 任务不可序列化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40009043/

相关文章:

azure - 有没有办法在 zeppelin Notebook 中使用 Spark SQL Connector

hadoop - Hive(在Tez上)和Spark之间针对我的特定用例进行的性能基准测试

java - 为什么 Comparator.comparingInt 不能在 Java 中的 Arrays.sort() 中工作?

Java:从字符串中删除引号

java - 调整 UI 默认值

python - 理解 One Line For 循环

python - 使用 Python 计算 Spark 中成对 (K,V) RDD 中每个 KEY 的平均值

python - 努力在 PySpark 中获取日志文件输出

java - 使用java关闭监视器灯

java - Ubuntu : No Java SDK Found error 上的 IntelliJ