performance - 将元组有效地处理为固定大小的向量

在 Chapel 中，可以像使用小“向量”一样使用同构元组(例如，a = b + c * 3.0 + 5.0;)。

但是，由于元组没有提供各种数学函数，我尝试为norm()写一个函数。并比较了它们的性能。我的代码是这样的:

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

我用 chpl --fast test.chpl 编译了上面的代码(具有 4 个内核的 OSX10.11 上的 Chapel v1.16，通过自制软件安装)。然后，norm_3tuple() , norm_loop() , 和 norm_loop_param()给出几乎相同的速度(0.45 秒)，而 norm_reduce()慢得多(大约 30 秒)。我检查了 top 的输出命令，然后 norm_reduce()使用所有 4 个内核，而其他功能仅使用 1 个内核。所以我的问题是...

是 norm_reduce()慢是因为 reduce并行工作
并且并行执行的开销很大
大于这个小元组的净计算成本？

鉴于我们想避免 reduce对于三元组，其他三个例程基本上以相同的速度运行。这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计(例如，通过 --fast 选项启用循环展开)？

在 norm_loop_param() ，我也试过用 param循环变量的关键字，但这给我带来了很少或没有性能提升。如果我们只对同构元组感兴趣，是否不需要附加 param完全(为了性能)？

我很抱歉同时提出很多问题，我将不胜感激任何关于有效处理小元组的建议/建议。非常感谢!

最佳答案

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?

我相信你是正确的，这就是正在发生的事情。减少是并行执行的，当工作可能无法保证时，Chapel 目前不会尝试进行任何智能节流来压缩这种并行性(如在这种情况下)，所以我认为您正在承受太多的任务开销除了协调其他任务之外几乎没有其他工作(虽然我很惊讶差异如此之大......但我也发现我对这些事情没有什么直觉)。将来，我们希望编译器能够序列化如此小的缩减以避免这些开销。

Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?

Chapel 编译器不会在 norm_loop() 中展开显式的 for 循环(您可以通过检查使用 --savec 标志生成的代码来验证这一点)，但可能是后端编译器。或者，与 norm_loop_param() 的展开循环相比，for 循环确实没有那么多成本。 .我怀疑您需要检查生成的程序集以确定是哪种情况。但我也希望后端 C 编译器能够很好地处理我们生成的代码——例如，很容易看出它是一个 3 次迭代循环。

In norm_loop_param(), I have also tried using param keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param at all (for performance)?

这很难给出明确的答案，因为我认为这主要是一个关于后端 C 编译器有多好的问题。

关于performance - 将元组有效地处理为固定大小的向量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47231369/

performance - 将元组有效地处理为固定大小的向量

上一篇：api - 获取资源的 RESTful 方式，但如果尚不存在则创建它

下一篇：azure-ad-b2c - 限制 Azure AD B2C 中应用的声明