启动 HN:General Instinct(YC P26) – 边缘设备上的前沿模型
大家好,我们是来自General Instinct的Guanming和Bill(<a href="https://general-instinct.com">https://general-instinct.com</a>)。
<p>经过多年的机器人领域工作,我们不断遇到同一个问题:最佳模型往往无法适配我们实际拥有的硬件。</p>
<p>表现最好的模型通常是基于数据中心的假设设计的:大型GPU、大量内存带宽和可靠的网络访问。但大多数物理系统却面临相反的限制。</p>
<p>这促使我们探索如何在确保模型前沿性的同时,使其能够在边缘硬件上实际运行。</p>
<p>作为这项工作的一个成果,我们最近开源了InstinctRazor(<a href="https://github.com/General-Instinct/InstinctRazor" rel="nofollow">https://github.com/General-Instinct/InstinctRazor</a>)。</p>
<p>我们特别兴奋的一项成果是将大约245 GB的BF16 MoE模型Qwen3.5-122B-A10B压缩到48 GiB的GGUF中。这个模型的体积实际上比Gemma-4-26B-A4B还要小,但在MMLU-Pro和GPQA-D等基准测试中表现更佳。我们保留了始终活跃的部分(如路由器、归一化层、Gated-DeltaNet/SSM层、视觉通路等),并对路由专家进行了更积极的量化。然后,我们使用在线蒸馏技术来恢复在量化过程中损失的能力。</p>
<p>该模型还可以在“小GPU”配置下运行,其中专家从系统RAM中流式传输。在8k上下文窗口下,峰值显存使用量约为7.6–8 GB。</p>
<p>如果您对技术细节感兴趣,我们在这里写下了相关方法(<a href="https://general-instinct.com/blog/frontier-moe-sub-4-bit">https://general-instinct.com/blog/frontier-moe-sub-4-bit</a>)。</p>
<p>我们尤其希望听到那些将模型部署到机器人或其他边缘设备上的人的反馈。您今天尝试在本地运行哪些模型?在将它们投入生产时,遇到的最大瓶颈是什么?</p>
查看原文
Hey HN, Guanming and Bill here from General Instinct (<a href="https://general-instinct.com/">https://general-instinct.com/</a>).<p>After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.<p>The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.<p>That led us down the path of figuring out how much of a frontier model could be preserved while still making it practical to run on edge hardware.<p>As part of that work, we recently open sourced InstinctRazor (<a href="https://github.com/General-Instinct/InstinctRazor" rel="nofollow">https://github.com/General-Instinct/InstinctRazor</a>)<p>One result we're excited about is compressing Qwen3.5-122B-A10B, a roughly 245 GB BF16 MoE model, into a 48 GiB GGUF. The resulting model is actually smaller than Gemma-4-26B-A4B while outperforming it on benchmarks like MMLU-Pro and GPQA-D etc. we preserve the parts that are always active (router, norms, Gated-DeltaNet/SSM layers, vision pathway, etc.) and quantize the routed experts much more aggressively. We then use on-policy distillation to recover capability lost during quantization.<p>The model can also run in a "small GPU" configuration where experts are streamed from system RAM. With an 8k context window, peak VRAM usage is around 7.6–8 GB.<p>If you're interested in the technical details, we wrote up the approach here (<a href="https://general-instinct.com/blog/frontier-moe-sub-4-bit">https://general-instinct.com/blog/frontier-moe-sub-4-bit</a>)<p>We're especially interested in hearing from people deploying models onto robots or other edge devices. What models are you trying to run locally today? What has been the biggest bottleneck in getting them into production?