assembly - ARM frsqrts 是否需要与额外的 fmul 指令一起使用以进行牛顿迭代？

在 ARM 指令的文档中 frsqrts ，它说:

This instruction multiplies corresponding floating-point values in the vectors of the two source SIMD and FP registers, subtracts each of the products from 3.0, divides these results by 2.0, places the results into a vector, and writes the vector to the destination SIMD and FP register.

我将其解释为 yₙ₊₁ = (3 - xyₙ)/2 - 事实上，以下代码证明了这种解释:

.global _main
.align 2
_main:
    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    frsqrts d2, d0, d1 // first approx

    mov x0, 0
    mov x16, #1 // '1' = terminate syscall
    svc #0x80   // "supervisor call"

但是，阅读有关 Newton iterate for the inverse square root 的内容，我发现迭代不是 yₙ₊₁ = (3 - xyₙ)/2，而是 yₙ₊₁ = yₙ(3 - xyₙ²)/2。现在，显然我可以将 frsqrt 与其他指令结合使用来获得此结果:

    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    fmul d2, d1, d1 // initial guess squared
    frsqrts d3, d0, d2 // (3-r*r*x)/2
    fmul d4, d1, d3 // d4 = r*(3-r*r*x)/2

但是引入自定义指令似乎很奇怪，它只能让您实现目标的一半。我是否滥用了这条指令？

最佳答案

这代表了将倒数平方根的 Newton-Raphson 迭代完全传统地划分为简单的类似 RISC 的指令。

例如，在AMD的3dNow! x86 的指令集扩展，这是指令 PFRSQIT1 的功能(全面披露:这是我设计的^[1])。此功能甚至不需要底层的 FMA 功能:它可以通过对现有乘法器进行轻微修改来实现，因为当按预期使用时，即作为倒数平方的 NR 迭代的一部分，结果接近 1.0根。

正如询问者推断的那样，frsqrts 需要接收倒数平方根的当前估计的平方作为其源操作数之一。由于 frsqrte 指令可提供精确到约 8 位的 1/sqrt(x) 估计值，因此计算单精度倒数平方根将需要两次 Newton-Raphson 迭代。从概念上讲:

     frsqrte  est0, x             // initial approximation, accurate to about 8 bits

     fmul     est0_sq, est0, est0 // first NR iteration for reciprocal square root
     frsqrts  tmp, est0_sq, x
     fmul     est1, tmp, est0     

     fmul     est1_sq, est1, est1 // second NR iteration for reciprocal square root
     frsqrts  tmp, est1_sq, x
     fmul     res, tmp, est1

此指令序列直接映射到一系列相应的内联函数:vrsqrte_f32()、vmul_f32() 和 vrsqrts_f32()。

^[1] S. Oberman、F. Weber、N. Juffa 和 G. Favor，“AMD 3DNow!^TM 技术和 K6-2 微处理器。 ” HotChips 10，1998 年 8 月 16-18 日 ( online )

关于assembly - ARM frsqrts 是否需要与额外的 fmul 指令一起使用以进行牛顿迭代？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76891426/

assembly - ARM frsqrts 是否需要与额外的 fmul 指令一起使用以进行牛顿迭代？

上一篇：Golang VSCode 如何停止/暂停自动导入？

下一篇：c++ - 使用 boost 反序列化时输入流错误