我想使用 openMP 优化以下代码
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
目前我有错误的输出,欢迎任何优化上述代码的建议。
最佳答案
我认为您可以做的最简单的事情是让它分成 4 个线程,并计算每个线程中的 UYVY 错误。不要让它们成为单独的值,而是让它们成为一个数组:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
这样,每个线程都只对一件事进行操作,并且没有争用。
也许这不是 OMP 的最高级用途,但您应该会看到加速。
在您评论性能下降后,我做了一些实验,发现性能确实更差。我怀疑这可能是由于缓存未命中造成的。
你说:
performance hit this time with openMP : Time :0.040637 with serial Time :0.018670
所以我使用每个变量的约简和单个循环对其进行了重新设计:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
在我的 4 核机器上测试用例,我观察到不到 4 倍的改进:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[编辑] 关于为什么第一段代码的性能比非并行版本差的问题,Hristo Iliev 说:
Your first piece of code is a terrible example of what false sharing does in multithreaded codes. As sqError has only 4 elements of 8 bytes each, it fits in a single cache line (even in a half cache line on modern x86 CPUs). With 4 threads constantly writing to neighbouring elements, this would generate a massive amount of inter-core cache invalidation due to false sharing. One can get around this by using instead a structure like this struct _error { double val; double pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate cache line, hence no false sharing.
关于c - 使用openmp优化MSE算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14804859/