c++ - Clang内置矩阵和 vector 扩展: efficient matrix-vector multiplication

我正在编写一个小型图形 3D 应用程序，以了解 Clang vector 和矩阵扩展 ( matrices still seem to be developed if I read the right versions of the doc )。

我不确定如何使用这些类型编写最有效的矩阵 vector 乘法代码。使用:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

文档说(关于访问矩阵元素的索引):

The first specifies the number of rows, and the second specifies the number of columns.

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

所以我知道执行 m[2][3] (其中 m 是 m4x4)，会给我在上面的矩阵中记下 X 的元素。

然后(关于元素在内存中的布局方式):

The elements of a value of a matrix type are laid out in column-major order without padding.

因此，我从这篇文章中得知，如果我可以查看元素在内存中存储的方式，我会得到:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33

到目前为止我做对了吗？

我们访问矩阵元素的顺序重要吗？ (我这样做对吗？)

然后我假设，如果我想提高 mat-float4 乘法的效率，我需要按照元素在内存中的布局方式访问它们:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

当然，我可以使用 __builtin_matrix_column_major_load 之类的东西在 m[0][0]、m[0][1]、... 中加载正确的值。 .

我是否把事情过于复杂化了，或者这里的顺序应该重要吗？上面的等式实际上优于:

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

(假设我在调用 __builtin_matrix_column_major_load 之前已经转置了元素。

有更好的方法吗？

现在我知道这些类型目前正在开发中。但我知道这些类型的全部意义在于利用 SIMD 指令。如果我这样做:

float4 a = {...};
float4 b = {...};
float4 c = a + b;

然后添加 a 的 4 个 float 分别为 b 的 4 个 float 发生在一个周期内？那么关于 mat-float4 乘法，因为我在代码中单独调用 float4 和 m4x4 的元素，所以在这种特殊情况下我似乎不会利用任何优化？

所以我的第二个问题:有更好的方法吗？

我应该将矩阵 vector 保留为 4 个 float4 并进行 float4 * float4 乘法吗？
我看到了这篇文章Matrix-Vector and Matrix-Matrix multiplication using SSE给出了如何使用 SIMD 指令实现 mat vector 乘法的示例。这似乎可以将矩阵的元素堆叠成 __m128并使用这些来获取矩阵元素乘以 vector 的元素，并使用附加的 SIMD 指令，例如 _mm_add_ps和mm_mul_ps .
我应该等待这个开发更加成熟吗？

如有任何反馈或建议，我们将不胜感激。我这样做是为了了解这些新的内置类型的练习。

最佳答案

如果有人现在发现这个:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

转换为“矩阵”列并定义乘积。这确实应该是内置的，尽管正如您所说，Clang matrix_types 尚未完成。

顺便说一句:您可以将相同的概念应用于 ext_vector_types 的点积，因为(AFAIK)它也不是内置的。 Dot 会将 float1x4 乘以 float4x1(按该顺序)。

关于c++ - Clang内置矩阵和 vector 扩展: efficient matrix-vector multiplication，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72859045/

c++ - Clang内置矩阵和 vector 扩展: efficient matrix-vector multiplication

我们访问矩阵元素的顺序重要吗？ (我这样做对吗？)

有更好的方法吗？

上一篇：c++ - CMake 将多个子项目构建到一个目录中

下一篇：java - 如何在 OptaPlanner 中为每组的最小和最大计数添加 HardConstraint？