c++ - OpenMP 没有加速

我正在使用 OpenMP 以获得具有近乎线性加速的算法。不幸的是，我注意到我无法获得所需的加速。

因此，为了理解我的代码中的错误，我编写了另一个代码，一个简单的代码，只是为了仔细检查加速在原则上是否可以在我的硬件上获得。

这是我写的玩具示例:

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"

int main () {
      int number_of_threads = 1;
      int n = 600;
      int m = 50;
      int N = n/number_of_threads;
      int time_limit = 600;
      double total_clock = omp_get_wtime();
      int time_flag = 0;

      #pragma omp parallel num_threads(number_of_threads)
       {
          int thread_id = omp_get_thread_num();
          int iteration_number_local = 0;
          double *C = new double[n]; std::fill(C, C+n, 3.0);
          double *D = new double[n]; std::fill(D, D+n, 3.0);
          double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

          while (time_flag == 0){
                for (int i = 0; i < N; i++)                     
                    for(int z = 0; z < m; z++)
                        for(int x = 0; x < n; x++)
                            for(int c = 0; c < n; c++){
                                CD[c] = C[z]*D[x];
                                C[z] = CD[c] + D[x];
                            }
                iteration_number_local++;
                if ((omp_get_wtime() - total_clock) >= time_limit) 
                    time_flag = 1; 
           }
       #pragma omp critical
       std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
       }
    }

我想再次强调，这段代码只是一个尝试查看加速的玩具示例:当并行线程数量增加时(因为 N 减少)，第一个 for-cycle 变得更短。

但是，当我从 1 个线程增加到 2-4 个线程时，迭代次数会按预期增加一倍；但是当我使用 8-10-20 个线程时情况并非如此:迭代次数不会随线程数线性增加。

你能帮我解决这个问题吗？代码是否正确？我应该期待接近线性的加速吗？

Results

Running the code above I got the following results.

1 thread: 23 iterations.

20 threads: 397-401 iterations per thread (instead of 420-460).

最佳答案

您的衡量方法有误。特别是对于少量迭代。

1 thread: 3 iterations.

3 次报告的迭代实际上意味着 2 次迭代在不到 120 秒内完成。第三个花了更长的时间。 1次迭代的时间在40~60 s之间。

2 threads: 5 iterations per thread (instead of 6).

4 次迭代在不到 120 秒的时间内完成。 1次迭代的时间在24~30 s之间。

20 threads: 40-44 iterations per thread (instead of 60).

40 次迭代在不到 120 秒的时间内完成。 1次迭代的时间在2.9~3 s之间。

如您所见，您的结果实际上与线性加速并不矛盾。

简单地执行一个外部循环并为其计时会更简单、更准确，您可能会看到近乎完美的线性加速。

您看不到线性加速的一些原因(非详尽)是:

内存限制性能。在您使用 n = 1000 的玩具示例中并非如此。更一般地说:争用共享资源(主内存、高速缓存、I/O)。
线程之间的同步(例如关键部分)。在您的玩具示例中并非如此。
线程之间的负载不平衡。在您的玩具示例中并非如此。
当所有核心都被利用时，Turbo 模式将使用较低的频率。这可能发生在您的玩具示例中。

从您的玩具示例中，我会说可以通过更好地使用高级抽象来改进您的 OpenMP 方法，例如为。

更一般的建议对于这种格式来说过于宽泛，需要有关非玩具示例的更多具体信息。

关于c++ - OpenMP 没有加速，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38630633/

c++ - OpenMP 没有加速

上一篇：c# - 将 C++ 特定功能映射到 C++/CLI

下一篇：c++ - GLUT 只运行一次显示回调并且不输出到终端