c - 使用 OpenMP 和 PThreads 的并行程序比顺序程序慢

标签 c performance pthreads openmp

我遇到了以下矩阵乘法程序的并行化问题。优化版本比顺序版本慢或快一点点。我已经在寻找错误了,但找不到它...我也在另一台机器上测试过它,但得到了相同的...

谢谢你的帮助

主要内容:

int main(int argc, char** argv){

    if((matrixA).size != (matrixB).size){
     fprintf(ResultFile,"\tError for %s and %s - Matrix A and B are not of the same size ...\n", argv[1], argv[2]);
    }
    else{
     allocateResultMatrix(&resultMatrix, matrixA.size, 0);

     if(*argv[5] == '1'){ /* Sequentielle Ausfuehrung */
      begin = clock();
      matrixMultSeq(&matrixA, &matrixB, &resultMatrix);
      end = clock();
     };

     if(*argv[5] == '2'){ /* Ausfuehrung mit OpenMP */
      printf("Max number of threads: %i \n",omp_get_max_threads());
      begin = clock();
      matrixMultOmp(&matrixA, &matrixB, &resultMatrix);
      end = clock();
     };

     if(*argv[5] == '3'){ /* Ausführung mittels PThreads */
      pthread_t  threads[NUMTHREADS];
      pthread_attr_t attr;
      int i;
      struct parameter arg[NUMTHREADS];

      pthread_attr_init(&attr); /* Attribut initialisieren */

      begin = clock();

      for(i=0; i<NUMTHREADS; i++){ /* Initialisierung der einzelnen Threads */
       arg[i].id = i;
       arg[i].num_threads = NUMTHREADS;
       arg[i].dimension = matrixA.size;
       arg[i].matrixA = &matrixA;
       arg[i].matrixB = &matrixB;
       arg[i].resultMatrix = &resultMatrix;
       pthread_create(&threads[i], &attr, worker, (void *)(&arg[i]));
      }

      pthread_attr_destroy(&attr);

      for(i=0; i<NUMTHREADS; i++){ /* Warten auf Rückkehr der Threads */
       pthread_join(threads[i], NULL);
      }

      end = clock();
    }

    t=end - begin;
    t/=CLOCKS_PER_SEC;
    if(*argv[5] == '1')
      fprintf(ResultFile, "\tTime for sequential multiplication: %0.10f seconds\n\n", t);
    if(*argv[5] == '2')
      fprintf(ResultFile, "\tTime for OpenMP multiplication: %0.10f seconds\n\n", t);
    if(*argv[5] == '3')
      fprintf(ResultFile, "\tTime for PThread multiplication: %0.10f seconds\n\n", t);
    }
  }
}

void matrixMultOmp(struct matrix * matrixA, struct matrix * matrixB, struct matrix * resultMatrix){
  int i, j, k, l;
  double sum = 0;

  l = (*matrixA).size;
#pragma omp parallel for private(j,k) firstprivate (sum)
  for(i=0; i<=l; i++){
   for(j=0; j<=l; j++){
      sum = 0;
      for(k=0; k<=l; k++){
         sum = sum + (*matrixA).matrixPointer[i][k]*(*matrixB).matrixPointer[k][j];
      }
      (*resultMatrix).matrixPointer[i][j] = sum;
    }
  }
}

void mm(int thread_id, int numthreads, int dimension, struct matrix* a, struct matrix* b, struct matrix* c){
  int i,j,k;
  double sum;
  i = thread_id;
  while (i <= dimension) {
    for (j = 0; j <= dimension; j++) {
      sum = 0;
      for (k = 0; k <= dimension; k++) {
    sum = sum + (*a).matrixPointer[i][k] * (*b).matrixPointer[k][j];
      }
      (*c).matrixPointer[i][j] = sum;
    }
    i+=numthreads;
 }
}

void * worker(void * arg){
  struct parameter * p = (struct parameter *) arg;
  mm((*p).id, (*p).numthreads, (*p).dimension, (*p).matrixA, (*p).matrixB, (*p).resultMatrix);
  pthread_exit((void *) 0);
}

这是带时间的输出: 开始为 matrices/SimpleMatrixA.txt 和 matrices/SimpleMatrixB.txt 计算 resultMatrix ... 矩阵A的大小:6个元素 matrixB 的大小:6 个元素 顺序乘法时间:0.0000030000秒

Starting calculating resultMatrix for matrices/SimpleMatrixA.txt and matrices/SimpleMatrixB.txt ...
    Size of matrixA: 6 elements
    Size of matrixB: 6 elements
    Time for OpenMP multiplication: 0.0002440000 seconds

Starting calculating resultMatrix for matrices/SimpleMatrixA.txt and matrices/SimpleMatrixB.txt ...
    Size of matrixA: 6 elements
    Size of matrixB: 6 elements
    Time for PThread multiplication: 0.0006680000 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for sequential multiplication: 0.0075190002 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for OpenMP multiplication: 0.0076710000 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for PThread multiplication: 0.0068080002 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for sequential multiplication: 9.6421155930 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for OpenMP multiplication: 10.5361270905 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for PThread multiplication: 9.8952226639 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for sequential multiplication: 1981.1383056641 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for OpenMP multiplication: 2137.8527832031 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for PThread multiplication: 1977.5153808594 seconds

最佳答案

如评论中所述,您的第一个也是主要问题是使用 clock()。它返回程序执行的处理器时间。您正在寻找的是程序执行的时间。在顺序代码中,这些是相同的,但多核则完全不同。幸运的是,OpenMP 已经为您准备好了:请改用函数 omp_get_wtime()

最后,您需要更大的矩阵才能看到多线程的好处。如果创建/管理线程的开销比线程正在处理的实际工作更昂贵,那么您将永远看不到并行性带来的任何好处。因此,为 6x6 矩阵乘法计时是没有意义的。我将从 1000x1000 开始,至少检查 2000x2000 和 8000x8000。

关于c - 使用 OpenMP 和 PThreads 的并行程序比顺序程序慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34223962/

相关文章:

c - Scanf 允许在静态数组中存储更多数量的字符

c - 数组输入而不询问数组元素的数量

sql - SQL Server 上的慢更新,即使没有记录更新

c - 线程中的链表?

C Pthreads - 线程安全队列实现的问题

c++ - 如何避免在 C++ 中进行一些繁重的处理时阻塞线程?

C错误: Expected Unqualified-id Before '{' Token

c - "Bring to front"OS-X 上的 GTK/C 应用程序

linux - Linux 中进程的 CPU 争用(等待时间)

python - 一种快速计算非空区域的方法