java - 为什么 Collection.parallelStream() 存在而 .stream().parallel() 做同样的事情?

标签 java java-8 java-stream

在 Java 8 中,Collection 接口(interface)扩展了两个返回 Stream<E> 的方法。 : stream() ,返回顺序流,parallelStream() ,它返回一个可能并行的流。 Stream 本身也有一个 parallel()返回等效并行流的方法(将当前流变为并行或创建新流)。

复制有明显的缺点:

  • 令人困惑。有问题问whether calling both parallelStream().parallel() is necessary to be sure the stream is parallel ,假设 parallelStream() 可能返回顺序流。 parallelStream() 不能保证为什么会存在?另一种方式也令人困惑——如果 parallelStream() 返回顺序流,则可能是有原因的(例如,并行流是性能陷阱的固有顺序数据结构); Stream.parallel() 应该为这样的流做什么? (Parallel() 的规范不允许出现 UnsupportedOperationException。)

  • 如果现有实现具有名称相似但返回类型不兼容的方法,则向接口(interface)添加方法可能会发生冲突。在 stream() 之外添加 parallelStream() 会使风险增加一倍,但收效甚微。 (请注意,parallelStream() 曾一度被命名为 parallel(),但我不知道重命名是为了避免名称冲突还是出于其他原因。)

为什么在调用 Collection.stream().parallel() 时存在 Collection.parallelStream() 做同样的事情?

最佳答案

Collection.(parallelS|s)tream()Stream 本身的 Javadocs 没有回答这个问题,所以它被关闭到邮件列表中以获得基本原理.我浏览了 lambda-libs-spec-observers 文件并找到了 one thread specifically about Collection.parallelStream()和另一个线程是否触及 java.util.Arrays should provide parallelStream()匹配(或者实际上,它是否应该被删除)。没有一劳永逸的结论,所以也许我遗漏了另一个列表中的某些内容,或者此事已通过私下讨论解决。 (也许 Brian Goetz 是本次讨论的主要负责人之一,可以填补任何遗漏的内容。)

参与者很好地表达了他们的观点,所以这个答案主要只是对相关引述的组织,在[括号]中做了一些澄清,按重要性顺序呈现(按照我的解释) .

parallelStream() 涵盖了一个非常常见的情况

Brian Goetz在第一个线程中,解释为什么 Collections.parallelStream() 即使在其他并行流工厂方法已被删除后仍然具有足够的值(value):

We do not have explicit parallel versions of each of these [stream factories]; we did originally, and to prune down the API surface area, we cut them on the theory that dropping 20+ methods from the API was worth the tradeoff of the surface yuckiness and performance cost of .intRange(...).parallel(). But we did not make that choice with Collection.

We could either remove the Collection.parallelStream(), or we could add the parallel versions of all the generators, or we could do nothing and leave it as is. I think all are justifiable on API design grounds.

I kind of like the status quo, despite its inconsistency. Instead of having 2N stream construction methods, we have N+1 -- but that extra 1 covers a huge number of cases, because it is inherited by every Collection. So I can justify to myself why having that extra 1 method is worth it, and why accepting the inconsistency of going no further is acceptable.

Do others disagree? Is N+1 [Collections.parallelStream() only] the practical choice here? Or should we go for the purity of N [rely on Stream.parallel()]? Or the convenience and consistency of 2N [parallel versions of all factories]? Or is there some even better N+3 [Collections.parallelStream() plus other special cases], for some other specially chosen cases we want to give special support to?

Brian Goetz在后面关于 Arrays.parallelStream() 的讨论中坚持这个立场:

I still really like Collection.parallelStream; it has huge discoverability advantages, and offers a pretty big return on API surface area -- one more method, but provides value in a lot of places, since Collection will be a really common case of a stream source.

parallelStream() 性能更高

Brian Goetz :

Direct version [parallelStream()] is more performant, in that it requires less wrapping (to turn a stream into a parallel stream, you have to first create the sequential stream, then transfer ownership of its state into a new Stream.)

为了回应 Kevin Bourrillion 对效果是否显着的怀疑,Brian again :

Depends how seriously you are counting. Doug counts individual object creations and virtual invocations on the way to a parallel operation, because until you start forking, you're on the wrong side of Amdahl's law -- this is all "serial fraction" that happens before you can fork any work, which pushes your breakeven threshold further out. So getting the setup path for parallel ops fast is valuable.

Doug Lea follows up ,但对冲他的头寸:

People dealing with parallel library support need some attitude adjustment about such things. On a soon-to-be-typical machine, every cycle you waste setting up parallelism costs you say 64 cycles. You would probably have had a different reaction if it required 64 object creations to start a parallel computation.

That said, I'm always completely supportive of forcing implementors to work harder for the sake of better APIs, so long as the APIs do not rule out efficient implementation. So if killing parallelStream is really important, we'll find some way to turn stream().parallel() into a bit-flip or somesuch.

的确,后面关于Arrays.parallelStream()的讨论takes notice of lower Stream.parallel() cost .

stream().parallel() 有状态使 future 复杂化

在讨论时,将流从顺序切换到并行并返回可以与其他流操作交错。 Brian Goetz, on behalf of Doug Lea ,解释了为什么顺序/并行模式切换可能会使 Java 平台的 future 开发复杂化:

I'll take my best stab at explaining why: because it (like the stateful methods (sort, distinct, limit)) which you also don't like, move us incrementally farther from being able to express stream pipelines in terms of traditional data-parallel constructs, which further constrains our ability to to map them directly to tomorrow's computing substrate, whether that be vector processors, FPGAs, GPUs, or whatever we cook up.

Filter-map-reduce map[s] very cleanly to all sorts of parallel computing substrates; filter-parallel-map-sequential-sorted-limit-parallel-map-uniq-reduce does not.

So the whole API design here embodies many tensions between making it easy to express things the user is likely to want to express, and doing is in a manner that we can predictably make fast with transparent cost models.

这个模式切换是removed after further discussion .在当前版本的库中,流管道是顺序的或并行的;最后一次调用 sequential()/parallel() 获胜。除了回避状态性问题外,此更改还提高了使用 parallel() 从顺序流工厂设置并行管道的性能。

将 parallelStream() 作为一等公民公开可以提高程序员对库的认识,从而使他们编写出更好的代码

Brian Goetz again , 回应 Tim Peierls's argument Stream.parallel() 允许程序员在并行之前按顺序理解流:

I have a slightly different viewpoint about the value of this sequential intuition -- I view the pervasive "sequential expectation" as one if the biggest challenges of this entire effort; people are constantly bringing their incorrect sequential bias, which leads them to do stupid things like using a one-element array as a way to "trick" the "stupid" compiler into letting them capture a mutable local, or using lambdas as arguments to map that mutate state that will be used during the computation (in a non-thread-safe way), and then, when its pointed out that what they're doing, shrug it off and say "yeah, but I'm not doing it in parallel."

We've made a lot of design tradeoffs to merge sequential and parallel streams. The result, I believe, is a clean one and will add to the library's chances of still being useful in 10+ years, but I don't particularly like the idea of encouraging people to think this is a sequential library with some parallel bags nailed on the side.

关于java - 为什么 Collection.parallelStream() 存在而 .stream().parallel() 做同样的事情?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57141262/

相关文章:

Java 8 : Extracting a pair of arrays out of a Stream<Pair>

Java从列表中删除某些属性重复的条目

java - Hibernate一对多关系java.sql.SQLIntegrityConstraintViolationException : Column 'person_id' cannot be null

java - 使用流读取文本文件 - lambda 表达式中的变量

java - Java 8 中的条件 lambda 执行

java - 如何在 List(不是 List)的 Function<T, R> 参数中应用

java - 如何交错(合并)两个 Java 8 Stream?

java - JBPM - 如何在流程中使用 CMT 命令

java - java中计算两个int的平均值并转换为double

java - 在 Eclipse 中为 ANT 配置 Java 版本