c++ - 在 amd64 架构上用 C++ 将图像缓冲区 blit 到另一个缓冲区的 xy 偏移的最快方法

我有任意大小的图像缓冲区，我将其复制到 x,y 偏移处的相同大小或更大的缓冲区中。色彩空间为 BGRA。我当前的复制方法是:

void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
    bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height * 4);
    }
    else {
        dest += (dest_y * dest_buffer_width * 4);
        for(uint i=0;i < src_height;i++) {
            memcpy(dest + (dest_x * 4), src, src_width * 4);
            dest += dest_buffer_width * 4;
            src += src_width * 4;
        }
    }
}

它运行得很快，但我很好奇是否可以做些什么来改进它并获得额外的毫秒数。如果涉及到汇编代码，我宁愿避免这种情况，但我愿意添加额外的库。

最佳答案

StackOverflow 上的一个流行答案确实使用了 x86-64 程序集和 SSE，可以在这里找到:Very fast memcpy for image processing? 。如果您确实使用此代码，则需要确保缓冲区是 128 位对齐的。该代码的基本解释是:

使用非临时存储，因此可以绕过不必要的缓存写入，并可以合并对主内存的写入。
读取和写入仅在非常大的 block 中交错(先进行多次读取，然后进行多次写入)。连续执行多次读取通常比单个读-写-读-写模式具有更好的性能。
使用了更大的寄存器(128 位 SSE 寄存器)。
包含预取指令作为 CPU 流水线的提示。

我找到了这个文档 - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - 这似乎是上述代码的灵感，但适用于较旧的处理器；然而，它确实包含了大量关于其工作原理的讨论。

例如，考虑有关写入组合/非临时存储的讨论:

The Pentium II and III CPU caches operate on 32-byte cache-line sized blocks. When data is written to or read from (cached) memory, entire cache lines are read or written. While this generally enhances CPU-memory performance, under some conditions it can lead to unnecessary data fetches. In particular, consider a case where the CPU will do an 8-byte MMX register store: movq. Since this is only one quarter of a cache line, it will be treated as a read-modify-write operation from the cache's perspective; the target cache line will be fetched into cache, then the 8-byte write will occur. In the case of a memory copy, this fetched data is unnecessary; subsequent stores will overwrite the remainder of the cache line. The read-modify-write behavior can be avoided by having the CPU gather all writes to a cache line then doing a single write to memory. Coalescing individual writes into a single cache-line write is referred to as write combining. Write combining takes place when the memory being written to is explicitly marked as write combining (as opposed to cached or uncached), or when the MMX non-temporal store instruction is used. Memory is generally marked write combining only when it is used in frame buffers; memory allocated by VirtualAlloc is either uncached or cached (but not write combining). The MMX movntps and movntq non-temporal store instructions instruct the CPU to write the data directly to memory, bypassing the L1 and L2 caches. As a side effect, it also enables write combining if the target memory is cached.

如果您更愿意坚持使用 memcpy，请考虑研究您正在使用的 memcpy 实现的源代码。一些 memcpy 实现会寻找 native 字对齐缓冲区，以通过使用完整的寄存器大小来提高性能；其他人会使用本地单词对齐自动复制尽可能多的内容，然后清理剩余部分。确保缓冲区是 8 字节对齐将有助于这些机制。

一些 memcpy 实现包含大量的预先条件，以使其对小缓冲区(<512)有效 - 您可能需要考虑复制粘贴代码，并删除这些 block ，因为您可能不工作缓冲区较小。

关于c++ - 在 amd64 架构上用 C++ 将图像缓冲区 blit 到另一个缓冲区的 xy 偏移的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29971794/

c++ - 在 amd64 架构上用 C++ 将图像缓冲区 blit 到另一个缓冲区的 xy 偏移的最快方法

上一篇：java - 无符号短字节数组

下一篇：c++ - 记录异常构造函数是不好的做法吗？