java - 在 Spark 中运行迭代程序(Java)时出现 Stackoverflow 错误

标签 java apache-spark cluster-analysis

我正在尝试在 Spark 中实现分层凝聚聚类算法。

并以本地模式运行,当输入数据较大时,总是会抛出java.lang.StackOverflowError

据说(例如Cut off the super long serialization chain),这是因为序列化链变得太长了。所以我每隔几次迭代就添加了检查点,但还是不行,有没有人遇到过这个问题。

我的代码是:

JavaRDD<LocalCluster> run(JavaRDD<Document> documents)
{
    @SuppressWarnings("resource")
    JavaSparkContext sc = new JavaSparkContext(documents.context());
    Broadcast<Double> mergeCriterionBroadcast = sc.broadcast(mergeCriterion);
    Accumulable<Long, Long> clusterIdAccumulator = sc.accumulable(documents.map(doc -> doc.getId()).max(Comparator.naturalOrder()), new LongAccumulableParam());

    // Clusters.
    JavaRDD<LocalCluster> clusters = documents.map(document -> 
    {
        return LocalCluster.of(document.getId(), Arrays.asList(document));          
    }).cache();

    // Calculate clusters pair-wise similarity, initially, each document forms a cluster.
    JavaPairRDD<Tuple2<LocalCluster, LocalCluster>, Double> similarities = clusters.cartesian(clusters).filter(clusterPair -> (clusterPair._1().getId() < clusterPair._2().getId()))
    .mapToPair(clusterPair ->
    {
        return new Tuple2<Tuple2<LocalCluster, LocalCluster>, Double>(clusterPair, LocalCluster.getSimilarity(clusterPair._1(), clusterPair._2()));
    })
    .filter(tuple -> (tuple._2() >= mergeCriterionBroadcast.value())).cache();

    // Merge the most similar two clusters.
    long count = similarities.count();
    int loops = 0;
    while (count > 0)
    {
        System.out.println("Count: " + count);
        Tuple2<Tuple2<LocalCluster, LocalCluster>, Double> mostSimilar = similarities.max(SerializableComparator.serialize((a, b) -> Double.compare(a._2(), b._2())));

        Broadcast<Tuple2<Long, Long>> MOST_SIMILAR = sc.broadcast(new Tuple2<Long, Long>(mostSimilar._1()._1().getId(), mostSimilar._1()._2().getId()));

        clusterIdAccumulator.add(1L);
        LocalCluster newCluser = LocalCluster.merge(mostSimilar._1()._1(), mostSimilar._1()._2(), clusterIdAccumulator.value());

        JavaRDD<LocalCluster> newClusterRDD = sc.parallelize(Arrays.asList(newCluser));
        JavaRDD<LocalCluster> filteredClusters = clusters.filter(cluster -> (cluster.getId() != MOST_SIMILAR.value()._1() && cluster.getId() != MOST_SIMILAR.value()._2()));

        JavaPairRDD<Tuple2<LocalCluster, LocalCluster>, Double> newSimilarities = filteredClusters.cartesian(newClusterRDD)
        .mapToPair(clusterPair ->
        {
            return new Tuple2<Tuple2<LocalCluster, LocalCluster>, Double>(clusterPair, LocalCluster.getSimilarity(clusterPair._1(), clusterPair._2()));
        })
        .filter(tuple -> (tuple._2() >= mergeCriterionBroadcast.value()));

        clusters = filteredClusters.union(newClusterRDD).coalesce(2).cache();
        similarities = similarities.filter(tuple -> 
                (tuple._1()._1().getId() != MOST_SIMILAR.value()._1()) && 
                (tuple._1()._1().getId() != MOST_SIMILAR.value()._2()) && 
                (tuple._1()._2().getId() != MOST_SIMILAR.value()._1()) && 
                (tuple._1()._2().getId() != MOST_SIMILAR.value()._2()))
                .union(newSimilarities).coalesce(4).cache();
        if ((loops++) >= 2)
        {
            clusters.checkpoint();
            similarities.checkpoint();
            loops = 0;
        }

        count = similarities.count();
    }

    return clusters.filter(cluster -> cluster.getDocuments().size() > 1);
}

堆栈跟踪的第一部分

15/08/05 10:11:19 ERROR TaskSetManager: Failed to serialize task 4078, not attempting to retry it.

java.io.IOException: java.lang.StackOverflowError
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1242)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.CoalescedRDDPartition$$anonfun$writeObject$1.apply$mcV$sp(CoalescedRDD.scala:48)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.CoalescedRDDPartition.writeObject(CoalescedRDD.scala:45)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.rdd.UnionPartition$$anonfun$writeObject$1.apply$mcV$sp(UnionRDD.scala:55)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at org.apache.spark.rdd.UnionPartition.writeObject(UnionRDD.scala:52)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

堆栈跟踪的另一部分:

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

而且在错误之前,有很多这样的警告。

15/08/05 11:19:56 WARN TaskSetManager: Stage 519 contains a task of very large size (226 KB). The maximum recommended task size is 100 KB.

最佳答案

终于找到了问题所在:checkpoint的使用。

在 Spark 中,checkpoint 允许用户截断 RDD 的沿袭,但只有在 RDD 通过action(如 collect , 计数).

在我的程序中,没有对clusters RDD 调用任何操作,因此没有创建检查点,clusters 的序列化链仍然保持增长并导致stackoverflowerror 最后。要修复它,请在检查点 之后调用count

        clusters = filteredClusters.union(newClusterRDD).coalesce(2).cache();
        similarities = similarities.filter(tuple -> 
                (tuple._1()._1().getId() != MOST_SIMILAR.value()._1()) && 
                (tuple._1()._1().getId() != MOST_SIMILAR.value()._2()) && 
                (tuple._1()._2().getId() != MOST_SIMILAR.value()._1()) && 
                (tuple._1()._2().getId() != MOST_SIMILAR.value()._2()))
                .union(newSimilarities).coalesce(4).cache();
        if ((loops++) >= 50)
        {
            clusters.checkpoint();
            clusters.count();
            similarities.checkpoint();
            loops = 0;
        }

关于java - 在 Spark 中运行迭代程序(Java)时出现 Stackoverflow 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31807377/

相关文章:

java - 寻找一本涵盖单元、功能、集成和场景测试的综合 Java 测试书籍

java - 服务器进行客户端身份验证

algorithm - 在 Apache Spark 中过滤空间数据

python - 我究竟做错了什么?尝试使用 PySpark 连接到 mySql 数据库,但我一直收到加载错误

java.lang.NoClassDefFoundError : kafka/common/TopicAndPartition 错误

r - R 中的 clusplot 期间出错

java - 有没有办法用java获取远程主机的操作系统

java - 访问远程数据库是否必须在android中编写Contract、Provider和DbHelper?

python - 您如何使用 scikit-learn 中的惯性计算间隙统计的标准差?

java - 如何在 elko 中为字符串定义自定义距离函数?