c++ - 在 C++ OpenMP 中以两种方式使用蒙特卡洛方法计算 pi

<分区>

什么方法应该更快？第一种方法是增加一个变量以减少:

#pragma omp parallel private(seed, x, y, i) reduction (+:counter)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
                y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter++;
        } 

}

第二个是使用每个进程的增量变量表，最后，该表中元素的总和是结果:

#pragma omp parallel private(seed, x, y, i , nproc)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
        y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter[nproc]++;
        } 

    }
}

double time = omp_get_wtime() - start_time;
int sum=0;
for(int i=0; i<8; i++){
    sum+=counter[i];

}

理论上，第二种方式应该更快，因为进程不是共享一个变量，而是每个进程都有自己的变量。但是当我计算执行时间时:

first approach: 3.72423 [s]

second approach: 8.94479[s]

我的想法是错误的还是我的代码做错了什么？

最佳答案

您是 false sharing 的受害者在第二种方法中。这里有一个有趣的 article from Intel关于那个。

False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth

直觉上，我不认为第一种方法应该更慢。
您确实在每个线程上创建了一个私有(private)拷贝，然后将最终结果应用到一个全局变量中。行为在某种程度上与您的共享数组相同，但这里的问题是即使您的访问是独立的，您也会得到错误的共享。

关于c++ - 在 C++ OpenMP 中以两种方式使用蒙特卡洛方法计算 pi，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33890613/

上一篇：c# - C# 中的 foreach 和带有 List<T> 的传统 for 循环

下一篇：c++ - 在 C++ 中继承抽象模板类并指定类型

相关文章：

c++ - 通过C++/Qt生成word文档(.doc/.odt)

javascript - 在 Javascript 中实现线性搜索算法

c++ - 冒泡排序中的交换次数

c++ - 使用 Visual Studio 2013 的 OpenMP 性能下降

c++ - 如何在工作线程中重用主线程创建的 OMP 线程池？

parallel-processing - 垂直和水平平行度

c++ - 使用线程本地存储将具有全局变量的单线程遗留代码转换为多线程代码

c++ - 如何隐式专门化转换？

c++ - 允许构造函数调用私有(private)方法的默认参数

c - 计算整数的所有因子的最快算法是什么？