c++ - OpenMP中的每个线程执行相同数量的工作是否正常?

标签 c++ multithreading performance parallel-processing openmp

对于下面的代码,我已经计算了每个线程的时间执行,对于我来说,使用静态或动态调度执行所有运行时,每个线程几乎都具有精确的时间调用,这使我感到奇怪。这在OpenMP中是期望的吗?我们是否曾经遇到过一个或多个线程执行更多工作的情况?
我不明白的另一件事是,使用静态和动态时间表的时间执行是相同的。恐怕可能是我以错误的方式计算时间。

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;
using namespace chrono; 
int main()
{
    const int N = 100000;
    ofstream result{"Result.txt"};
    vector<vector<double>> c;
    default_random_engine g(0);
    uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));
    c.reserve(N);

    for (int i = 0; i < N; i++) {
        const unsigned size = pow(10, i % 4);
        vector<double> a;
        a.reserve(size);

        for (int j = 0; j < size; j++) {
            const double number = d(g);
            a.push_back(number);
        }

        c.push_back(std::move(a));
    }

    double sum = 0.0;
    vector<double> b(N);
    int total_threads=4; 
    double time_taken_by_threads[total_threads];
    auto t1= high_resolution_clock::now();
    
    #pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
    
    {
        int threadID = omp_get_thread_num();
        double start = omp_get_wtime();
     
    
        #pragma omp for reduction(+:sum) schedule(dynamic)
        for (int i = 0; i < N ; i++) {
            double sumLocal = 0.0;

            for (int j = 0; j < c[i].size();j++) {
                sumLocal += pow(c[i][j], 2);
            }

            const double n = sqrt(sumLocal);
            b[i] = n;

            sum += sumLocal;
        }
        
      
        double end = omp_get_wtime();
       time_taken_by_threads[threadID] = end - start;
    }
      
    
    auto t2=high_resolution_clock::now();
    
    auto diff=duration_cast<milliseconds>(t2-t1);
    
    cout<<"The total job has been taken : "<<diff.count()<<endl; 
    


   for(int i=0; i<total_threads ; i++){
   
   cout<<" Thread work "<<  time_taken_by_threads[i]<<endl; 
   
   }
    
 }

最佳答案

TL; DR 您在#pragma omp for reduction(+:sum)的末尾有一个隐式屏障

I'm afraid that maybe, I'm calculating time in a wrong manner.


实际上,由于#pragma omp for,它总是会给出相似的结果:
    double start = omp_get_wtime();
    #pragma omp for reduction(+:sum) schedule(dynamic)
    for (int i = 0; i < N ; i++) {
        // ....
    }
   // <--- threads will wait here for one another.
   double end = omp_get_wtime();
   time_taken_by_threads[threadID] = end - start;
在循环之后引入一个隐式的barrier。因此,首先完成的线程仍将等待那些尚未完成的线程。要消除该隐式障碍,可以使用nowait子句:
#pragma omp for reduction(+:sum) schedule(dynamic) nowait
即使在这段代码中这不是问题,但在删除隐式barrier时也要小心,因为它可能导致竞争条件。因此,为了将来使用,您可以使用以下模式来测量每个线程花费的时间,并且仍然避免潜在的竞争条件。
    double start = omp_get_wtime();
    // The parallel loop with nowait
    double end = omp_get_wtime();
    #pragma omp barrier
    time_taken_by_threads[threadID] = end - start;
但是,即使进行了更改,每个线程所花费的时间也应该大致相同。我将在下面解释为什么会这样。

For the following code, I have calculated time execution for each thread and it is odd to me that with all runs I get using the static or dynamic schedule, each thread has nearly exact time invocation. Is this something expected in OpenMP?


可以预料到,使用静态计划时,OpenMP会尝试在线程之间尽可能平均地划分循环迭代次数。
OpenMP 5.1标准中,可以阅读有关for schedule子句的以下内容:

When kind is static, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.


在您的情况下,当使用默认块大小的static分布时,这4个线程中的每个线程将计算25000次迭代(即100000/4)。
如果我们分析并行循环:
#pragma omp for reduction(+:sum) schedule(static)
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = 0; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);
    }

    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;
}
我们可以看到,每个迭代执行相同的计算量,并且该计算主要受CPU限制,因此可以预期每个线程将花费大约相同的时间。
关于来自OpenMP 5.1标准的动态时间表,可以阅读以下内容:

When kind is dynamic, the iterations are distributed to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, it defaults to 1.


因此,由于默认情况下块大小为1,并且我们已经知道循环的迭代将花费大约相同的时间量,因此可以预期线程也将花费相同的时间量。

Do we ever have the situation that one or more threads perform more jobs?


确保您只需要创建导致负载不平衡的情况,例如:
#pragma omp parallel for schedule(static)
  for(int i=0; i<N; i++){
      for(int k=0; k<i; k++){
          // some computation  
       }
   }
如果仔细看,您会发现内部循环的工作以三角形(N = SIZE)的形状增长:
 *k/i 0 1 2 3 4 5 ... N-1
 *  0 - x x x x x ... x 
 *  1 - - x x x x ... x 
 *  2 - - - x x x ... x
 *  3 - - - - x x ... x
 *  4 - - - - - x ... x
 *  5 - - - - - - ... x
 *  . - - - - - - ... x
 *  . - - - - - - ... x 
 *N-1 - - - - - - ... -    
 *  N - - - - - - ... - 
因此,对于4个线程和N这样的N % 4 = 0,将为线程1分配循环的第一个N/4迭代,为线程2分配下一个N/4,依此类推。因此,线程1用较少的最内层循环迭代来计算最外层循环迭代,这导致负载不平衡,并最终导致线程在完成并行工作所花费的时间之间具有更大的差异。
您可以在代码中模拟该场景,如下所示:
#pragma omp for reduction(+:sum) schedule(static) nowait
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = i; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);
    }
    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;
}

Another thing which I do not understand is that time execution for both using static and dynamic schedule is the same.


正如我们已经解释的,鉴于分配给每个线程的并行任务的性质,这是可以预期的。

关于c++ - OpenMP中的每个线程执行相同数量的工作是否正常?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66716454/

相关文章:

c++ - 内存泄漏和 delete[] 崩溃

带重载的 C++ 动态绑定(bind)

c# - 我需要在页面加载后发帖,但我使用的是 Thread

python 和数据库异步请求(又名即发即弃): how to?

linux - 在同一个 Linux 目录中有数百或数千个文件是否可以(性能方面)?

javascript - : redundant HTML layouts, 和不必要的 JavaScript 执行哪个更可取?

c++ - #define Dbg(fmt,…)(0)是什么意思?警告:表情无效

swift - NSManagedObject 不是故障,但应用程序在后台线程上访问它时崩溃

c++ - pthread_spinlock 和 boost::smart_ptr::spinlock 之间的区别?

c++ - 我可以在下面的程序中使用 sem_open 吗,但是我在这里看到了崩溃?