ARCHE3-7B – 采用智能路由器和基础课程培训的稀疏Moe

1作者: OpenSynapseLabs8 天前原帖
这是我在HN上的第一篇帖子——有点紧张,但也很兴奋能分享我正在构建的东西。 我一直在开发一个可以在消费级硬件上运行的7B稀疏专家混合模型原型。例如,在Colab T4上,它在训练期间使用大约5 GB的RAM和5 GB的显存,推理时大约需要3.5到5 GB。 我花了很多时间在几个方面: **路由(SmartRouter)** 我尝试以一种实用的方式解决路由崩溃的问题。与其让所有的token都倾向于几个“最爱”的专家,我结合了几种方法:负载均衡损失、保持分布平坦的熵奖励、训练期间的抖动噪声,以及一个可学习的温度。这个方法在保持大量专家活跃方面效果出乎意料地好。如果有人想查看数学原理或将其用于自己的项目,我已经开源了路由器代码(hive_router.py)。 **基础课程训练(FCT)** 在标准预训练之前,我让模型通过结构化推理模式进行训练——目前有290个模式,涵盖14个认知领域。每个模式遵循严格的顺序:观察 → 先验 → 更新 → 涟漪 → 类比 → 行动。 为了让这个在我的设置上实际运行,我做了几个特定的技巧。首先,我使用了目标专用损失(屏蔽标签和输入,仅对实际推理负载(如更新或行动)计算梯度)。其次,我不得不编写一个自定义的SparseExpertAdamW,只为在该步骤中实际活跃的专家实例化优化器状态。如果没有这个,20480个专家的优化器状态将会彻底压垮我的RAM。 到目前为止,我已经完成了14个领域中的5个。一个很酷的事情是:每个新领域的损失都低于前一个领域(例如,系统领域的损失从2.149降到0.941),这似乎表明跨领域迁移确实在发生。 **架构简述:** - d_model = 2048 - 10层(5个密集核心 + 5个融合层) - 20480个专家(8个领域 × 2560) - 动态Top-K(2–4) - 内存映射权重 + Dopamine Learning v1 模型已上传至HuggingFace:[https://huggingface.co/OpenSynapseLabs/arche3-7b](https://huggingface.co/OpenSynapseLabs/arche3-7b) 我将基准测试和图表放在了GitHub上:[https://github.com/OpenSynapseLabs/arche3-benchmarks](https://github.com/OpenSynapseLabs/arche3-benchmarks) **局限性(老实说):** 我还没有运行标准基准测试(MMLU、GSM8K、HumanEval),只有5/14个FCT领域完成,数据集仍然较小,需要适当扩展。此外,这目前是一个独立项目。我确实使用了Gemini和Claude来加速部分实现,但架构和核心思想都是我自己的。 我非常欢迎任何反馈,特别是如果你对MoE模型中的路由、课程预训练或进一步扩展(考虑到35B)感兴趣。 我的主要目标是构建能够增强人类思维的系统,而不是取代它。如果这听起来像是你想要参与或贡献的内容,请随时通过opensynapselabs@proton.me与我联系。我很乐意分享更多细节和私有仓库。 感谢阅读!
查看原文
This is my first post on HN — a bit nervous, but excited to share what I&#x27;ve been building.<p>I’ve been working on a 7B sparse Mixture-of-Experts prototype that can actually run on consumer hardware. For example, on a Colab T4 it uses around 5 GB RAM and 5 GB VRAM during training, and roughly 3.5–5 GB for inference.<p>A couple of things I spent a lot of time on:<p>Routing (SmartRouter) I tried to tackle routing collapse in a practical way. Instead of letting all tokens dump into a few &quot;favorite&quot; experts, I combined a few things: load balancing loss, an entropy bonus to keep the distribution flat, jitter noise during training, and a learnable temperature. It works surprisingly well at keeping a good portion of experts active. I’ve open-sourced the router code (hive_router.py) if anyone wants to look at the math or grab it for their project.<p>Foundation Curriculum Training (FCT) Before standard pretraining, I run the model through structured reasoning patterns — currently 290 of them across 14 cognitive domains. Each pattern follows a strict sequence: OBSERVE → PRIOR → UPDATE → RIPPLE → ANALOGY → ACT.<p>To make this actually run on my setup, I&#x27;m doing a couple of specific tricks. First, I use a Target-Only Loss (masking out the tags and inputs and only calculating gradients on the actual reasoning payloads like UPDATE or ACT). Second, I had to write a custom SparseExpertAdamW that only instantiates optimizer states for the experts that are actually active on that step. Without this, the optimizer states for 20,480 experts would have absolutely crushed my RAM.<p>So far I’ve completed 5 out of 14 domains. One cool thing: every new domain starts with a lower loss than the previous one (for example, the Systems domain went from 2.149 down to 0.941), so it seems like the cross-domain transfer is actually happening.<p>The architecture in short:<p>d_model = 2048<p>10 layers (5 Dense Core + 5 Fusion)<p>20,480 experts (8 domains × 2560)<p>Dynamic Top-K (2–4)<p>memory-mapped weights + Dopamine Learning v1<p>Model is up on HuggingFace: https:&#x2F;&#x2F;huggingface.co&#x2F;OpenSynapseLabs&#x2F;arche3-7b And I put the benchmarks &amp; graphs on GitHub: https:&#x2F;&#x2F;github.com&#x2F;OpenSynapseLabs&#x2F;arche3-benchmarks<p>Limitations (to be honest): I haven’t run standard benchmarks yet (MMLU, GSM8K, HumanEval), only 5&#x2F;14 FCT domains are done, and the dataset is still small and needs proper scaling. Plus, this is a solo project so far. I did use Gemini and Claude to speed up parts of the implementation, but the architecture and core ideas are my own.<p>I’d really appreciate any feedback, especially if you’re into routing in MoE models, curriculum pretraining, or scaling this further (thinking about 35B next).<p>My main goal is to build systems that amplify human thinking, not replace it. If that sounds like something you&#x27;d want to mess around with or contribute to, feel free to reach out at opensynapselabs@proton.me. I&#x27;m happy to share more details and the private repo.<p>Thanks for reading!