c++ - OpenMP中的每个线程执行相同数量的工作是否正常?

标签 c++ multithreading performance parallel-processing openmp


#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;
using namespace chrono; 
int main()
    const int N = 100000;
    ofstream result{"Result.txt"};
    vector<vector<double>> c;
    default_random_engine g(0);
    uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));

    for (int i = 0; i < N; i++) {
        const unsigned size = pow(10, i % 4);
        vector<double> a;

        for (int j = 0; j < size; j++) {
            const double number = d(g);


    double sum = 0.0;
    vector<double> b(N);
    int total_threads=4; 
    double time_taken_by_threads[total_threads];
    auto t1= high_resolution_clock::now();
    #pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
        int threadID = omp_get_thread_num();
        double start = omp_get_wtime();
        #pragma omp for reduction(+:sum) schedule(dynamic)
        for (int i = 0; i < N ; i++) {
            double sumLocal = 0.0;

            for (int j = 0; j < c[i].size();j++) {
                sumLocal += pow(c[i][j], 2);

            const double n = sqrt(sumLocal);
            b[i] = n;

            sum += sumLocal;
        double end = omp_get_wtime();
       time_taken_by_threads[threadID] = end - start;
    auto t2=high_resolution_clock::now();
    auto diff=duration_cast<milliseconds>(t2-t1);
    cout<<"The total job has been taken : "<<diff.count()<<endl; 

   for(int i=0; i<total_threads ; i++){
   cout<<" Thread work "<<  time_taken_by_threads[i]<<endl; 


TL; DR 您在#pragma omp for reduction(+:sum)的末尾有一个隐式屏障

I'm afraid that maybe, I'm calculating time in a wrong manner.

实际上,由于#pragma omp for,它总是会给出相似的结果:
    double start = omp_get_wtime();
    #pragma omp for reduction(+:sum) schedule(dynamic)
    for (int i = 0; i < N ; i++) {
        // ....
   // <--- threads will wait here for one another.
   double end = omp_get_wtime();
   time_taken_by_threads[threadID] = end - start;
#pragma omp for reduction(+:sum) schedule(dynamic) nowait
    double start = omp_get_wtime();
    // The parallel loop with nowait
    double end = omp_get_wtime();
    #pragma omp barrier
    time_taken_by_threads[threadID] = end - start;

For the following code, I have calculated time execution for each thread and it is odd to me that with all runs I get using the static or dynamic schedule, each thread has nearly exact time invocation. Is this something expected in OpenMP?

OpenMP 5.1标准中,可以阅读有关for schedule子句的以下内容:

When kind is static, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.

#pragma omp for reduction(+:sum) schedule(static)
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = 0; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);

    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;
关于来自OpenMP 5.1标准的动态时间表,可以阅读以下内容:

When kind is dynamic, the iterations are distributed to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, it defaults to 1.


Do we ever have the situation that one or more threads perform more jobs?

#pragma omp parallel for schedule(static)
  for(int i=0; i<N; i++){
      for(int k=0; k<i; k++){
          // some computation  
如果仔细看,您会发现内部循环的工作以三角形(N = SIZE)的形状增长:
 *k/i 0 1 2 3 4 5 ... N-1
 *  0 - x x x x x ... x 
 *  1 - - x x x x ... x 
 *  2 - - - x x x ... x
 *  3 - - - - x x ... x
 *  4 - - - - - x ... x
 *  5 - - - - - - ... x
 *  . - - - - - - ... x
 *  . - - - - - - ... x 
 *N-1 - - - - - - ... -    
 *  N - - - - - - ... - 
因此,对于4个线程和N这样的N % 4 = 0,将为线程1分配循环的第一个N/4迭代,为线程2分配下一个N/4,依此类推。因此,线程1用较少的最内层循环迭代来计算最外层循环迭代,这导致负载不平衡,并最终导致线程在完成并行工作所花费的时间之间具有更大的差异。
#pragma omp for reduction(+:sum) schedule(static) nowait
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = i; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);
    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;

Another thing which I do not understand is that time execution for both using static and dynamic schedule is the same.


关于c++ - OpenMP中的每个线程执行相同数量的工作是否正常?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66716454/


c++ - 内存泄漏和 delete[] 崩溃

带重载的 C++ 动态绑定(bind)

c# - 我需要在页面加载后发帖,但我使用的是 Thread

python 和数据库异步请求(又名即发即弃): how to?

linux - 在同一个 Linux 目录中有数百或数千个文件是否可以(性能方面)?

javascript - : redundant HTML layouts, 和不必要的 JavaScript 执行哪个更可取?

c++ - #define Dbg(fmt,…)(0)是什么意思?警告:表情无效

swift - NSManagedObject 不是故障,但应用程序在后台线程上访问它时崩溃

c++ - pthread_spinlock 和 boost::smart_ptr::spinlock 之间的区别?

c++ - 我可以在下面的程序中使用 sem_open 吗,但是我在这里看到了崩溃?