我有一个执行一些蒙特卡罗算法的简单程序。该算法的一次迭代没有副作用，因此我应该能够使用多个线程运行它。所以这是 my whole program 的相关部分，这是用 C++11 编写的:

void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) {
    for (unsigned int n = 0; n < max_iter; ++n) {
        nume::Album album(535);
        unsigned int steps = album.fill_up();
        *iterator = steps;
        ++iterator;
    }
}

void aufgabe2() {
    std::cout << "\nAufgabe 2\n";

    unsigned int max_iter = 10000;

    unsigned int thread_count = 4;

    std::vector<std::thread> threads(thread_count);
    std::vector<unsigned int> results(max_iter);

    std::cout << "Computing with " << thread_count << " threads" << std::endl;

    int i = 0;
    for (std::thread &thread: threads) {
        std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i;
        thread = std::thread(task, max_iter/thread_count, &results, start);
        i++;
    }

    for (std::thread &thread: threads) {
        thread.join();
    }

    std::ofstream out;
    out.open("out-2a.csv");
    for (unsigned int count: results) {
        out << count << std::endl;
    }
    out.close();

    std::cout << "Siehe Plot" << std::endl;
}

令人费解的是，我添加的线程越多，速度就越慢。有 4 个线程，我得到这个:

real    0m5.691s
user    0m3.784s
sys     0m10.844s

而对于单线程:

real    0m1.145s
user    0m0.816s
sys     0m0.320s

我意识到在 CPU 内核之间移动数据可能会增加开销，但是 vector 应该在启动时声明，而不是在中间修改。这在多核上变慢有什么特别的原因吗？

我的系统是 i5-2550M，有 4 个内核(2 个 + 超线程)，我使用 g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3

更新

我看到不使用线程 (1) 时，会有很多用户负载，而使用线程 (2) 时，内核会比用户负载多:

10K 次运行:

http://wstaw.org/m/2013/05/08/stats3.png

10 万次运行:

http://wstaw.org/m/2013/05/08/Auswahl_001.png

Current main.cpp

运行 100K 次后，我得到以下结果:

根本没有线程:

real    0m28.705s
user    0m28.468s
sys     0m0.112s

程序的每个部分都有一个线程。这些部分甚至不使用相同的内存，因此我也应该排除同一容器的并发性。但它需要更长的时间:

real    2m50.609s
user    2m45.664s
sys     4m35.772s

所以虽然这三个主要部分占用了我 300% 的 CPU，但它们花费了 6 倍的时间。

对于 1M 次运行，它需要 真正的 4m45 才能完成。我之前跑了 1M，它至少用了 real 20m，甚至 real 30m。

最佳答案

在 GitHub 上评估了您当前的 main.cpp。除了上面提供的评论:

是的，rand() 不是线程安全的，因此在运行多线程业务逻辑之前用随机值预填充一些数组可能是值得的(这样可以减少可能的锁数量)。如果您计划进行一些堆事件(在多线程之前进行预分配或使用自定义的每线程分配器)，内存分配也是如此。
不要忘记其他进程。如果您计划在 4 个内核上使用 4 个线程，这意味着您将与其他软件(至少是操作系统例程)竞争 CPU 资源。
文件输出是一个大储物柜播放器。您在每次循环迭代中都执行“<<”运算符，这会花费您很多(我记得我过去有一个有趣的案例:执行日志输出间接修复了一个多线程错误。因为通用记录器是锁驱动的，所以它是某种同步原语，注意!)。
最后，没有任何形式的保证多线程应用程序可以比单线程应用程序更快。有很多 CPU 特定的、环境特定的等方面。

关于c++ - 使用多线程时程序变慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16437105/

c++ - 使用多线程时程序变慢

更新

上一篇：c++ - 使用 glFrustum 进行离轴投影

下一篇：c++ - 如何让 CMake 检查我的 header 是否自给自足？