编辑:现在我意识到我没有很好地解释我的算法。我会再试一次。
我所做的与两个 vector 的点积非常相似,但还是有区别的。我有两个 vector :一个位 vector 和一个相同长度的 float vector 。所以我需要计算总和: float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1],但与经典点积的区别在于我需要在每个设置位之后跳过一些固定数量的元素。
例子:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
在这种情况下,总和计算如下:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
以下代码使用不同的数据被多次调用,因此我需要优化它以在我的 Core i7 上尽可能快地运行(我不太关心与其他任何东西的兼容性)。做了一定程度的优化,还是很慢,不知道如何进一步改进。 位数组实现为 64 位无符号整数数组,它允许我使用 bitscanforward 查找下一个设置位。
代码:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler 显示 3 个最慢的热点:
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
但可能有不同的方法来做到这一点。我正在考虑在位 vector 中的每个设置位之后重置 nSkip 位,然后计算经典的点积 - 还没有尝试但老实说不相信它会随着更多的内存访问而变得更快。
最佳答案
您在循环内有太多操作。您也只有一个循环,因此每个标志字(64 位无符号整数)确实需要发生的许多操作额外发生了 63 次。
将除法视为一项昂贵的操作,并在优化代码以提高性能时尽量不要经常这样做。
就所需时间而言,内存访问也被认为是昂贵的,因此这也应仅限于必需的访问。
允许您提前退出的测试通常很有用(尽管有时测试本身相对于您要避免的操作来说是昂贵的,但这里可能不是这种情况。
使用嵌套循环应该可以大大简化这一过程。外循环应该工作在64位字级别,内循环应该工作在位级别。
我注意到我之前的建议有一个错误。由于这里除以 64,即 2 的幂,这实际上不是一个昂贵的操作,但我们仍然需要尽可能多地在循环之外进行操作。
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);
关于c - 如何优化C代码: looking for the next set bit and finding sum of corresponding array elements,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30742133/