这是我上一个问题的第二个问题
Faster way to do multi dimensional matrix addition?
在遵循@Peter Cordes 的建议后,我对代码进行了矢量化,现在速度提高了 50 倍。然后我再次做了gprof,发现这个函数占用了大部分时间。
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 69.97 1.53 1.53 cal_score(int, std::string, int const*, int, double)
double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha)
{
const int cols =4;
const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
double score = 0;
char s;
string alphabet="ACGT";
int count=0;
for(int k=0;k<cols;k++)
count=count+pwcluster[k];
for (int i = 0; i < l; i++){
long row_offset = cols*i;
s=seq[i];
//#pragma omp simd
for(int k=0;k<cols;k++) {
if (s==alphabet[k])
score=score+log( ( pwcluster[row_offset+k]+alpha )/(count+4*alpha) );
}
}
return score;
}
我是第一次进行代码优化,所以不知道如何继续。那么有没有什么办法可以把这个函数写得更好呢?这样我就能获得更快的速度。 输入seq是长度为l的字符'ACGT'的序列。 pw 是大小为 2*l*4 或 [p][q][r] 的一维数组,簇为 p。
最佳答案
这是重写它的另一种方法。
这会使用查找表而不是搜索来转换字符串,并且调用 log
的次数减少了 10 倍。
这还将 seq
更改为按引用传递的 const char*
,而不是按值传递的 std::string
。 (这将复制整个字符串)。
unsigned char transTable[128];
void InitTransTable(){
memset(transTable, 0, sizeof(transTable));
transTable['A'] = 0;
transTable['C'] = 1;
transTable['G'] = 2;
transTable['T'] = 3;
}
static int tslen = 0; // static instead of global lets the compiler keep tseq in a register inside the loop
static unsigned char* tseq = NULL; // reusable buffer for translations. Not thread-safe
double cal_score(
int l
, const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value
, const int *__restrict__ pw
, int cluster
, double alpha
)
{
int i, j, k;
// make sure tseq is big enough
if (tseq == NULL){
tslen = std::max(4096, l+1024);
tseq = new unsigned char[tslen];
memset(tseq, 0, tslen);
} else if (l > tslen-1){
delete tseq;
tslen = l + 4096;
tseq = new unsigned char[tslen];
memset(tseq, 0, tslen);
}
// translate seq into tseq
// (decrementing i so the beginning of tseq will be hot in cache when we're done)
for (i = l; --i >= 0;) tseq[i] = transTable[seq[i]];
const int cols = 4;
const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
double score = 0;
// count up pwcluster
int count=0;
for(k = 0; k < cols; k++) count += pwcluster[k];
double count4alpha = (count + 4*alpha);
long row_offset = 0;
for (i = 0; i < l;){
double product = 1;
for (j = 0; j < 10 && i < l; j++, i++, row_offset += cols){
k = tseq[i];
product *= (pwcluster[row_offset + k] + alpha) / count4alpha;
}
score += log(product);
}
return score;
}
这个compiles to fairly good code ,但如果没有 -ffast-math
,除法不能被乘法代替。
它不会自动矢量化,因为我们只加载 pwcluster
的每四个元素之一。
关于c++ - 计算序列可能性的更快方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37254001/