展示HN:SGR – 一种线性复杂度的“活细胞”,超越变换器

2作者: MrPan12 天前原帖
我正在开发一种名为稀疏门控共振(Sparse Gated Resonance,SGR)的架构。这是一种序列建模方法,旨在避免传统自注意力机制的平方扩展。我已经在维克多·雨果的《巴黎圣母院》(英文版)上对比了一个722k参数的SGR与一个921k参数的Transformer模型。 SGR用“因果脉冲”替代了注意力机制。它使用门控一维卷积生成一个导航向量,与字符嵌入的脑图进行共振。这使得模型能够保持“活细胞”状态,并以线性复杂度进行更新。 完整源代码和实现: [https://github.com/MrPan2048/GeometricTransformer](https://github.com/MrPan2048/GeometricTransformer) 基准测试数据(《巴黎圣母院》): | 步骤 | 架构 | 损失 | 困惑度(PPL) | 熵 | 时间 | |------|------|------|---------------|----|------| | 3900 | SGR | 1.4481 | 4.26 | 1.5476 | 19.0ms | | | STD | 2.0275 | 7.59 | 2.1476 | 40.3ms | 语义比较(生成自“卡西莫多”): SGR: “卡西莫多。然后思考着那种……” STD: “卡西莫多 ng, o uer tre the todo hemo’He wand at tine.” 技术观察: 计算效率:SGR保持了显著的延迟优势,运行时间稳定在约19ms,而Transformer则约为40ms。这证实了线性脉冲相较于平方注意力的效率。 收敛质量:到第3700步,SGR达到了4.46的困惑度(PPL),而Transformer则滞后于8.36。SGR成功生成了可识别的英语短语和标点,而Transformer仍然表现出“口吃”伪影(例如,“卡西莫多多多”)。 熵稳定性:SGR的熵稳定在约1.54,这代表了英语文本的最佳“掌握区”。而Transformer的较高熵(约2.14)与其缺乏结构连贯性相关。 我希望能获得支持,以便在arXiv(CS.ML)上发表关于此架构的正式论文。我相信这些结果表明,“活细胞”共振模型在参数受限和延迟敏感的环境中可以超越注意力机制。如果您是一位愿意支持或审阅数学形式化的研究人员,请通过GitHub与我联系。
查看原文
I am developing an architecture called Sparse Gated Resonance (SGR). It is a sequence modeling approach designed to avoid the quadratic scaling of traditional Self-Attention. I have been benchmarking a 722k-parameter SGR against a 921k-parameter Transformer on Victor Hugo’s &quot;Notre-Dame de Paris&quot; (English).<p>The SGR replaces the attention mechanism with a &quot;Causal Pulse.&quot; It uses gated 1D convolutions to generate a navigation vector that resonates against a brain-map of character embeddings. This allows the model to maintain a &quot;Living Cell&quot; state that updates with linear complexity.<p>Full source and implementation: <a href="https:&#x2F;&#x2F;github.com&#x2F;MrPan2048&#x2F;GeometricTransformer&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;MrPan2048&#x2F;GeometricTransformer&#x2F;</a><p>Benchmarking Data (Notre-Dame de Paris):<p>STEP 3900 ARCH | LOSS | PPL | ENT | TIME SGR | 1.4481 | 4.26 | 1.5476 | 19.0ms STD | 2.0275 | 7.59 | 2.1476 | 40.3ms<p>Semantic Comparison (Generation from &quot;Quasimodo&quot;):<p>SGR: &quot;Quasimodo. Then minds that the accasteady which which the&quot; STD: &quot;Quasimododo ng, o uer tre the todo hemo’He wand at tine.&quot;<p>Technical Observations:<p>Computational Efficiency: SGR maintains a significant latency advantage, consistently running at ~19ms compared to the Transformer&#x27;s ~40ms. This confirms the efficiency of the linear pulse over quadratic attention.<p>Convergence Quality: By Step 3700, SGR reached a Perplexity (PPL) of 4.46, whereas the Transformer lagged at 8.36. SGR successfully produces recognizable English phrases and punctuation, while the Transformer still exhibits &quot;stuttering&quot; artifacts (e.g., &quot;Quasimodododod&quot;).<p>Entropy Stability: SGR has stabilized at an entropy of ~1.54, which represents the optimal &quot;Mastery Zone&quot; for English text. The Transformer’s higher entropy (~2.14) correlates with its lack of structural coherence.<p>I am seeking an endorsement to publish a formal paper on this architecture to arXiv (CS.ML). I believe these results demonstrate that &quot;Living Cell&quot; resonance models can outperform Attention in parameter-constrained and latency-sensitive environments. If you are a researcher willing to endorse or review the mathematical formalization, please contact me via GitHub.