DeltaProduct:通过豪斯霍尔德积改善线性递归神经网络中的状态跟踪
链接:https://openreview.net/forum?id=nvb60szj5C
Twitter / X: https://x.com/julien_siems/status/1905628609714286687
作者:Julien Siems、Timur Carstensen、Arber Zela、Frank Hutter、Massimiliano Pontil、Riccardo Grazzi*(*同等贡献)
摘要:线性递归神经网络(线性 RNN)已成为序列建模中与变压器(Transformers)竞争的替代方案,提供高效的训练和线性时间的推理。然而,现有架构在表达能力和效率之间存在根本性的权衡,这由其状态转移矩阵的结构决定。虽然在 Mamba、GLA 或 mLSTM 等架构中使用的对角矩阵能够实现快速运行时间,但它们的表达能力受到严重限制。为了解决这个问题,最近的架构如(门控)DeltaNet 和 RWKV-7 采用了对角加秩-1的结构,允许同时进行令牌-通道混合,从而在仅略微降低训练效率的情况下克服了一些表达能力的限制。基于对 DeltaNet 的递归解释为每个令牌在关联回忆损失上执行一步在线梯度下降的理解,我们引入了 DeltaProduct,它每个令牌执行多个(nh)步骤。这自然导致了对角加秩状态转移矩阵的形成,这些矩阵是 nh 个广义 Householder 变换的乘积,提供了一种可调机制来平衡表达能力和效率,并实现稳定的递归。通过广泛的实验,我们证明 DeltaProduct 在状态跟踪和语言建模能力上优于 DeltaNet,同时在长度外推方面显著改善。此外,我们还通过证明 DeltaNet 仅需两层即可解决二面体群的单词问题,进一步加强了其理论基础。
查看原文
https://openreview.net/forum?id=nvb60szj5C<p>Twitter / X: https://x.com/julien_siems/status/1905628609714286687<p>Authors: Julien Siems<i>, Timur Carstensen</i>, Arber Zela, Frank Hutter, Massimiliano Pontil, Riccardo Grazzi* (*equal contribution)<p>Abstract: Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. While diagonal matrices used in architectures like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited expressivity. To address this, recent architectures such as (Gated) DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, allowing simultaneous token-channel mixing, which overcomes some expressivity limitations with only a slight decrease in training efficiency. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple (nh) steps per token. This naturally leads to diagonal plus rank-state-transition matrices, formed as products of nh generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency and a stable recurrence. Through extensive experiments, we demonstrate that DeltaProduct achieves superior state-tracking and language modeling capabilities while exhibiting significantly improved length extrapolation compared to DeltaNet. Additionally, we also strengthen the theoretical foundation of DeltaNet by proving that it can solve dihedral group word problems in just two layers.