java - 分析: Performance of ForkJoinPool

问题

由于Fork-Join似乎是当前的炒作，并且在许多答案中都被推荐，我想:为什么不对它的实际速度进行一些研究？

为了衡量这一点，我编写了一个小程序(请参见下面的代码)，进行一些数字加法并使用各种参数(包括线程数，fork-depth和fork-spread)进行 fork ，然后测量执行时间，尤其是执行时间。实际计算所花费的时间与 fork 所花费的时间。

摘要答案

尽管实现得当，但是ForkJoin是并行执行任务的极低效率的方法，因为每个fork的成本都很高。天真的问题优化实现可以轻松地归档99％的线程执行时间(这超过了用Fork-Join衡量的所有结果)，因此，这种实现始终比Fork-Join实现快。另外，如果每个fork的实际任务很小，那么Fork-Join实现可能比单线程线性实现要慢得多。

因此，Fork-Join只是一个问题，即它是否对代码的体系结构有所帮助，因为它与其他实现相比没有任何性能上的好处。因此，仅在以下情况下才应使用Fork-Join:

性能并不重要，任务经常需要等待其他任务的结果继续。因此，从根本上来说，如果Fork-Join结构大大简化了任务，那么它就变得简单了。

实际任务大大超过了 fork 的成本，因此损失可以忽略不计。在我的测试中，添加2个值的循环每个 fork 必须循环至少10000次才能获得合理的性能。

编辑:请参阅here以获取我所指向的更深入的分析。

测试设置

在我的程序中，我有一个RecursiveTask计算给定N的斐波那契数列，这将实际计算减少到3个赋值和1个加法。对于任何给定的CPU，这应该是次要的任务。

在测试中，我改变了线程数量，每个任务的 fork 数量以及Fibonacci循环的长度。另外，我使用async参数进行了一些测试，但是将此参数设置为false仅显示了计算时间的少量减少，因此我跳过了这一点。由于结果没有显着差异，因此也几乎跳过了传播参数(每叉货叉)。

通常，计算时间非常稳定，实际花费在任务上的时间百分比通常相差不到1％，因此，每个测试集在原本空闲的系统上已运行了大约5次(如果数量不稳定，则运行更多次)。 4个内核(+4个超内核)，然后选择中值执行时间。

已通过各种测试变量验证了正确的执行，尤其是已验证所使用的实际线程数与初始给定的并行度参数永不不同。

详细测试结果

在哪里:

Time total是从主线程角度来看整个计算所花费的总时间。

Time task是在所有合并的 fork 中实际计算斐波那契数列所花费的时间。

Time task percentage是线程带来的相对 yield (时间任务/总时间)。

spread->depth是(设置)传播(每叉 fork )和(计算出的) fork 深度。

threads是实际使用的线程数量。

task-time/thread是每个线程实际用于整体计算斐波那契数列的时间。

传播->深度测试:

Time total: 8766.670 ms, time task: 1717.418 ms ( 19.59%), spread->depth:  2->26, thread#: 1, task-time/thread: 19.59%
Time total: 7872.244 ms, time task: 1421.478 ms ( 18.06%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.06%
Time total: 7336.052 ms, time task: 1280.036 ms ( 17.45%), spread->depth: 100-> 4, thread#: 1, task-time/thread: 17.45%

结论: fork 数只产生很小的影响(更少的 fork =更好)，实现似乎相当复杂。在其他设置下也收集了类似的结果，因此在此我将其跳过。

Fib(0)(几乎所有时间都花在了 fork 上)

Time total: 7866.777 ms, time task: 1421.488 ms ( 18.07%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.07%
Time total: 7085.142 ms, time task: 1349.207 ms ( 19.04%), spread->depth: 10-> 8, thread#: 2, task-time/thread:  9.52%
Time total: 6580.609 ms, time task: 1476.467 ms ( 22.44%), spread->depth: 10-> 8, thread#: 4, task-time/thread:  5.61%

结论:完成一项非常小的任务，大部分时间都花在了 fork 上，使单线程实现比任何Fork-Join设置快大约5倍。即使有多个线程，使用Fork-Join也无法获得任何性能提升。

纤维(100)

Time total: 12487.634 ms, time task: 5707.720 ms ( 45.71%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 45.71%
Time total:  8386.855 ms, time task: 5768.881 ms ( 68.78%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 34.39%
Time total:  7078.769 ms, time task: 6086.997 ms ( 85.99%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 21.50%

结论:似乎已经接近单线程执行的收支平衡点，而多线程开始产生影响。单线程实现仍然比任何Fork-Join设置都快。

Fib(1000)

Time total:  5941.344 ms, time task:  5228.258 ms ( 88.00%), spread->depth: 10-> 7, thread#: 1, task-time/thread: 88.00%
Time total:  3160.818 ms, time task:  5244.241 ms (165.91%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 82.96%
Time total: 16301.697 ms, time task: 53351.694 ms (327.28%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 81.82%

结论:在这里，时间开始趋于稳定，以近乎线性的增益实现多线程执行，而每个线程的计算时间仍然约有20％花费在 fork 上。尽管此时 fork 可以通过线程提高性能，但朴素的实现仍然会明显更快。

菲伯(10000)

Time total:  5204.786 ms, time task:  5119.133 ms ( 98.35%), spread->depth: 10-> 6, thread#: 1, task-time/thread: 98.35%
Time total: 26033.889 ms, time task: 51084.118 ms (196.22%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 98.11%
Time total: 13183.573 ms, time task: 51637.471 ms (391.68%), spread->depth: 10-> 7, thread#: 4, task-time/thread: 97.92%

结论:在此数字下，计算的成本超过了 fork 的成本。尽管幼稚的实现仍然会稍微快一些，但是如果以另一种方式来实现该任务更加困难，那么 fork 所造成的损失就可以忽略不计。

代码

public class Test {

  static final int NUM_THREADS = 4;
  static final int SPREAD = 10;
  static final int LOOPS = 4000000;
  static final int CALCULATION_N = 10000;
  static final boolean DO_ASYNC = true;
  //---
  static final long MAX_DEPTH = Math.round(Math.log(LOOPS) / Math.log(SPREAD)); // try to have the execution take about the same time

  private static class Task extends RecursiveTask<Integer> {

    final static AtomicLong timeExecute = new AtomicLong(0);
    final static AtomicLong totalLoops = new AtomicLong(0);
    final long depth;

    public Task(final long depth) {
      this.depth = depth;
    }

    @Override
    protected Integer compute() {
      if (depth < MAX_DEPTH) {
        final Task[] subTasks = new Task[SPREAD];
        for (int i = 0; i < subTasks.length; ++i) {
          subTasks[i] = new Task(depth + 1);
        }
        try {
          invokeAll(subTasks);
          final long startTime = System.nanoTime();
          int result = 0;
          for (final Task task : subTasks) {
            if (task.isCompletedNormally()) {
              result += task.get();
            }
          }
          timeExecute.addAndGet(System.nanoTime() - startTime);
          return result;
        } catch (Exception e) {
          this.completeExceptionally(e);
          return null;
        }
      } else {
        totalLoops.incrementAndGet();
        final long startTime = System.nanoTime();
        int a = 0, b = 1, h;
        for (int n = 0; n < CALCULATION_N; ++n) {
          h = b;
          b = a + b;
          a = h;
        }
        timeExecute.addAndGet(System.nanoTime() - startTime);
        return b;
      }
    }
  }

  public static void main(String[] args) {
    final AtomicInteger threadCount = new AtomicInteger(0);
    final ForkJoinPool pool = new ForkJoinPool(NUM_THREADS, new ForkJoinPool.ForkJoinWorkerThreadFactory() {
      @Override
      public ForkJoinWorkerThread newThread(ForkJoinPool pool) {
        threadCount.getAndIncrement();
        final ForkJoinWorkerThread result = ForkJoinPool.defaultForkJoinWorkerThreadFactory.newThread(pool);
        result.setPriority(Thread.MIN_PRIORITY);
        return result;
      }
    }, null, DO_ASYNC);
    final long startTime = System.nanoTime();
    final Integer result = pool.invoke(new Task(0));
    final double duration = ((double) (System.nanoTime() - startTime)) / 1000000.0;
    final double executionDuration = ((double) Task.timeExecute.get()) / 1000000.0;
    final double executionPercent = executionDuration / duration * 100.0;
    final double executionPercentPerThread = executionPercent / ((double) NUM_THREADS);

    System.out.println("Completed: " + result + " in " + Task.totalLoops.get() + " loops.");
    System.out.println(String.format("Time total: %8.3f ms, time task: %8.3f ms (%6.2f%%), spread->depth: %2d->%2d, thread#: %1d, task-time/thread: %5.2f%%", duration, executionDuration, executionPercent, SPREAD, MAX_DEPTH, threadCount.get(), executionPercentPerThread));
  }
}

随时指出任何错误或提出改进建议。对于某些奖励积分，我将接受最有值(value)的答案。

最佳答案

意见建议:

打印制作的 fork 数量+完成工作的成本估算(即，如果您切换到 fork ，则将添加数量或BigInteger的长度之和)。这个比例将显示您的 fork 策略有多有效，并使您了解什么是最小的工作规模才有意义。

检查您的算法-斐波纳契数呈指数增长，您的任务返回整数，因此您很快就会溢出。

因此，目标是选择一个阈值，该阈值将表明是否 fork :

One of the main things to consider when implementing an algorithm using fork/join parallelism is choosing the threshold which determines whether a task will execute a sequential computation rather than forking parallel sub-tasks.

If the threshold is too large, then the program might not create enough tasks to fully take advantage of the available processors/cores.

If the threshold is too small, then the overhead of task creation and management could become significant.

In general, some experimentation will be necessary to find an appropriate threshold value. Source

这也可能很有用:How to determine the proper work division threshold of a fork-join task。

关于java - 分析: Performance of ForkJoinPool，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20288379/

java - 分析: Performance of ForkJoinPool

上一篇：java - 为什么try-with-resources获取 `Stream<String>`时BufferedReader没有关闭？

下一篇：java - Java 中的对象池模式