tensorflow - 级联矩阵乘法是否比多个非级联矩阵乘法更快？如果是这样，为什么？

LSTM 单元的定义涉及与输入的 4 次矩阵乘法，以及与输出的 4 次矩阵乘法。我们可以通过连接 4 个小矩阵(现在矩阵大 4 倍)使用单个矩阵乘法来简化表达式。

我的问题是:这是否提高了矩阵乘法的效率？如果是这样，为什么？因为我们可以把它们放在连续的内存中？还是因为代码简洁？

无论我们是否连接矩阵，我们相乘的项目数都不会改变。 (因此复杂性不应该改变。)所以我想知道为什么我们要这样做..

这是 torch.nn.LSTM(*args, **kwargs) 的 pytorch 文档的摘录。 W_ii、W_if、W_ig、W_io 连接在一起。

weight_ih_l[k] – the learnable input-hidden weights of the \text{k}^{th}k 
th
  layer (W_ii|W_if|W_ig|W_io), of shape (4*hidden_size x input_size)

weight_hh_l[k] – the learnable hidden-hidden weights of the \text{k}^{th}k 
th
  layer (W_hi|W_hf|W_hg|W_ho), of shape (4*hidden_size x hidden_size)

bias_ih_l[k] – the learnable input-hidden bias of the \text{k}^{th}k 
th
  layer (b_ii|b_if|b_ig|b_io), of shape (4*hidden_size)

bias_hh_l[k] – the learnable hidden-hidden bias of the \text{k}^{th}k 
th
  layer (b_hi|b_hf|b_hg|b_ho), of shape (4*hidden_size)

最佳答案

LSTM 的结构不是为了提高乘法效率，而是为了绕过递减/爆炸梯度 (https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem)。目前正在进行一些研究来减轻渐减梯度的影响，而 GRU/LSTM 单元 + 窥视孔很少尝试减轻这种影响。

关于tensorflow - 级联矩阵乘法是否比多个非级联矩阵乘法更快？如果是这样，为什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54799384/

上一篇：reactjs - 将 "Row Number"添加到 react-admin Datagrid 中的列表组件

下一篇：ionic-framework - 在 ionic 4 中发布警报渲染默认日期

相关文章：

python - 在 Keras 中使用 Subtract 层

python - TensorFlow 您必须使用 dtype float 为占位符张量 'Placeholder_2' 提供一个值

javascript - tensorflowjs 如何在 cnn 预测中获取内层输出

python - 如何有条件地将值分配给张量[损失函数的掩蔽]？

c++ - 这个按升序打印按行和按列排序的矩阵的程序的零在哪里？

python - 用于语音识别的 Tensorflow LSTM 在训练每个后续单词时会变慢

tensorflow - 即使批量大小较小，Keras fit_generator 也会使用大量内存

Swift - 如何减少矩阵

matlab - 在 MATLAB 中将大型邻接矩阵转换为边列表的有效方法？

python - 二元交叉熵和二元交叉熵与keras中的logits有什么区别？