java - 避免在 JavaPairRDD Apache Spark 中进行 Group By

我已经使用 JavaRDD 在 Spark 中重写了这段代码。我读到 groupByKey 是一种代价高昂的操作。

我们可以通过避免使用 groupByKey 来重写它吗？

按键分组后，我尝试更新键的值(如果适用)。

谁能帮忙

    List<Items> items = getItems();
    Map<String, List<ItemId>> itemsByName = items.stream()
            .collect(Collectors.groupingBy(ItemId::getName, Collectors.toList()));

    List<ItemId> newItems = itemsByName.entrySet().stream()
            .collect(Collectors.toMap(e -> e.getKey(), e -> {
            //update values if applicable
                List<ItemId> rps = e.getValue().stream().filter(s -> s.isApplicable()).collect(Collectors.toList());
                return rps.isEmpty() ? e.getValue() : rps;
            }))
            .values().stream()
            .flatMap(x -> x.stream()).collect(Collectors.toList());

JavaRDD

    JavaRDD<Items> items = getItemsRDD();
    JavaPairRDD<String, ItemId> itemsByName = 
            items.mapToPair(e -> new Tuple2<String, ItemId>(e.getName(), e));

    JavaRDD<ItemId> newItems= itemsByName.groupByKey().mapValues(x->{
        //update values if applicable
        List<ItemId> e = new ArrayList<>();
        x.iterator().forEachRemaining(e::add);
        List<ItemId> rps = e.stream().filter(s -> s.isApplicable()).collect(Collectors.toList());
        return rps.isEmpty() ? e: rps;
     }).flatMap(x->x._2);

我正在尝试做一些类似的事情，但是在java中 How to update column based on a condition (a value in a group)?

最佳答案

应避免使用 GroupByKey。尝试使用reduceByKey，它会在使用相同的键对数据进行混洗之前在每个分区上应用您的函数。

打乱的数据越少越好。

这是一个很好的例子 https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

关于java - 避免在 JavaPairRDD Apache Spark 中进行 Group By，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46912651/

java - 避免在 JavaPairRDD Apache Spark 中进行 Group By

上一篇：java - freemarker 2.3.21 和 freemarker 2.3.26-incubating 有什么区别

下一篇：javascript - 我在保存日期和时间时遇到问题？