c++ - 为什么ubuntu 12.04下的OpenMP比串口版慢

我已经阅读了有关此主题的其他一些问题。然而，他们并没有解决我的问题。

我写了如下代码，我得到的pthread 版本和omp 版本都比串行版本慢。我很困惑。

编译环境:

Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Vendor ID:             AuthenticAMD
CPU family:            18
Model:                 1
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              3593.36
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

编译命令:

g++ -std=c++11 ./eg001.cpp -fopenmp

#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>

#define NUM_THREADS 5
const int sizen = 256000000;

struct Data {
    double * pSinTable;
    long tid;
};

void * compute(void * p) {
    Data * pDt = (Data *)p;
    const int start = sizen * pDt->tid/NUM_THREADS;
    const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
    for(int n = start; n < end; ++n) {
        pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
    }
    pthread_exit(nullptr);
}

int main()
{
    double * sinTable = new double[sizen];
    pthread_t threads[NUM_THREADS];
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    clock_t start, finish;

    start = clock();
    int rc;
    Data dt[NUM_THREADS];
    for(int i = 0; i < NUM_THREADS; ++i) {
        dt[i].pSinTable = sinTable;
        dt[i].tid = i;
        rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
    }//for
    pthread_attr_destroy(&attr);
    for(int i = 0; i < NUM_THREADS; ++i) {
        rc = pthread_join(threads[i], nullptr);
    }//for
    finish = clock();
    printf("from pthread: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
#   pragma omp parallel for
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from omp: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from serial: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;

    pthread_exit(nullptr);
    return 0;
}

输出:

from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000

我怀疑是不是我代码的问题，所以我用pthread做了同样的事情。

然而，我完全错了，我想知道这是否是 Ubuntu 在 OpenMP/pthread 上的问题。

我有一个 friend 也有 AMD CPU 和 Ubuntu 12.04，并且在那里遇到了同样的问题，所以我可能有一些理由相信这个问题不仅限于我。

如果有人和我有同样的问题，或者对这个问题有一些线索，在此先感谢。

如果代码不够好，我运行了一个基准测试并将结果粘贴在这里:

http://pastebin.com/RquLPREc

基准网址:http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html

新信息:

我使用 VS2012 在 Windows(没有 pthread 版本)上运行代码。

我使用了 1/10 的 sizen，因为 Windows 不允许我在结果所在的位置分配大内存:

from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)

这是否表明它可能是 Ubuntu OS 的问题？？

问题通过使用可在操作系统之间移植的 omp_get_wtime 函数解决。请参阅 Hristo Iliev 的回答。

FreeNickName 对争议话题的一些测试。

(抱歉，我需要在 Ubuntu 上测试它，因为 Windows 是我 friend 的一个。)

--1-- 从 delete 更改为 delete [] : (但没有 memset)(-std=c++11 -fopenmp)

from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501

--2-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp)

from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723

--3-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779

--4-- 在 new 之后立即使用 memset，并将 FreeNickName 的版本放在 OMP 之前的版本:(-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979

--5-- 在 new 之后立即使用 memset，并将 FreeNickName 的版本放在 OMP 之前作为版本，并将 NUM_THREADS 设置为 5 而不是 2(我是双核)。

from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101

最佳答案

在您的案例中，OpenMP 没有任何问题。错误的是您测量耗时的方式。

使用 clock() 来测量 Linux(以及大多数其他类 Unix 操作系统)上多线程应用程序的性能是一个错误，因为它不返回挂钟(实时)时间而是所有进程线程的累积 CPU 时间(在某些 Unix 风格上，甚至是所有子进程的累积 CPU 时间)。您的并行代码在 Windows 上显示出更好的性能，因为 clock() 返回实时而不是累积的 CPU 时间。

防止此类差异的最佳方法是使用可移植的 OpenMP 计时器例程 omp_get_wtime():

double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf\n", finish - start);

对于非 OpenMP 应用程序，您应该将 clock_gettime() 与 CLOCK_REALTIME 时钟一起使用:

struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
                          (start.tv_sec + 1.e-9 * start.tv_nsec));

关于c++ - 为什么ubuntu 12.04下的OpenMP比串口版慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23184527/

c++ - 为什么ubuntu 12.04下的OpenMP比串口版慢

上一篇：c++ - 表达式 'decltype(MyTag::non_static_m.test + 1)' 在语法上是否有效？

下一篇：c++ - 将对象传递给类构造函数