c++ - 为什么在Linux上双缓冲实现比Windows慢8倍？

我已经写了一个双缓冲区的实现:

// ping_pong_buffer.hpp

#include <vector>
#include <mutex>
#include <condition_variable>

template <typename T>
class ping_pong_buffer {
public:

    using single_buffer_type = std::vector<T>;
    using pointer = typename single_buffer_type::pointer;
    using const_pointer = typename single_buffer_type::const_pointer;

    ping_pong_buffer(std::size_t size)
        : _read_buffer{ size }
        , _read_valid{ false }
        , _write_buffer{ size }
        , _write_valid{ false } {}

    const_pointer get_buffer_read() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return _read_valid; });
        }
        return _read_buffer.data();
    }

    void end_reading() {
        {
            std::lock_guard<std::mutex> lk(_mtx);
            _read_valid = false;
        }
        _cv.notify_one();
    }

    pointer get_buffer_write() {
        _write_valid = true;
        return _write_buffer.data();
    }

    void end_writing() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return !_read_valid; });
            std::swap(_read_buffer, _write_buffer);
            std::swap(_read_valid, _write_valid);
        }
        _cv.notify_one();
    }

private:

    single_buffer_type _read_buffer;
    bool _read_valid;
    single_buffer_type _write_buffer;
    bool _write_valid;
    mutable std::mutex _mtx;
    mutable std::condition_variable _cv;

};

使用仅执行交换操作的虚拟测试，其性能在Linux上比Windows差20倍:

#include <thread>
#include <iostream>
#include <chrono>

#include "ping_pong_buffer.hpp"

constexpr std::size_t n = 100000;

int main() {

    ping_pong_buffer<std::size_t> ppb(1);

    std::thread producer([&ppb] {
        for (std::size_t i = 0; i < n; ++i) {
            auto p = ppb.get_buffer_write();
            p[0] = i;
            ppb.end_writing();
        }
    });

    const auto t_begin = std::chrono::steady_clock::now();

    for (;;) {
        auto p = ppb.get_buffer_read();
        if (p[0] == n - 1)
            break;
        ppb.end_reading();
    }

    const auto t_end = std::chrono::steady_clock::now();

    producer.join();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '\n';

    return 0;

}

测试环境为:

Linux(Debian Stretch):Intel Xeon E5-2650 v4，GCC:900至1000毫秒

GCC标志:-O3 -pthread

Windows(10):Intel i7 10700K，VS2019:45至55毫秒

VS2019标志:/O2

您可以在here in godbolt中找到代码，并为GCC和VS2019提供ASM输出，并实际使用编译器标志。
在其他计算机上也发现了巨大的差距，这似乎是由于操作系统造成的。
造成这种惊人差异的原因可能是什么？
UPDATE :
该测试也已在相同的10700K的Linux上进行，但仍比Windows慢8倍。

Linux(Ubuntu 18.04.5):Intel i7 10700K，GCC:290至300毫秒

GCC标志:-O3 -pthread

如果迭代次数增加了10倍，我将得到2900毫秒。

最佳答案

正如Mike Robinson回答的那样，这很可能与Windows和Linux上不同的锁定实现有关。
通过分析每个实现切换上下文的频率，我们可以快速了解功能的开销。我可以做Linux配置文件，好奇是否有人可以尝试在Windows上进行配置文件。

我在Intel(R)Core(TM)i9-8950HK CPU @ 2.90GHz CPU上运行Ubuntu 18.04
我使用g++ -O3 -pthread -g test.cpp -o ping_pong进行编译，并记录了如何使用此命令切换上下文:sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong我使用以下命令从性能计数中提取了一个报告:sudo perf report -n --header --stdio > linux_ping_pong_report.sched该报告很大，但是我只对本节感兴趣，该节显示记录了约200,000个上下文切换:

# Total Lost Samples: 0
#
# Samples: 198K of event 'sched:sched_switch'
# Event count (approx.): 198860
#

我认为这表明性能确实很差，因为在测试中，有n=100000项被推送并弹出到双缓冲区，因此几乎每次我们调用end_reading()或end_writing()时都有一个上下文切换，这是我期望从使用中获得的std::condition_variable。

关于c++ - 为什么在Linux上双缓冲实现比Windows慢8倍？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66869421/

c++ - 为什么在Linux上双缓冲实现比Windows慢8倍？

上一篇：unit-testing - 测试数据密集型遗留应用程序的技巧

下一篇：python - 如何在Python中改善列表上的模式匹配