展示HN:我们如何在2个GPU上运行60个Hugging Face模型

2作者: pveldandi5 天前原帖
大多数开源大型语言模型(LLM)部署假设每个GPU只运行一个模型。这在流量稳定的情况下是可行的。但实际上,许多工作负载是长尾或间歇性的,这意味着GPU大部分时间处于闲置状态。 我们尝试了一种不同的方法。 我们并没有将一个模型固定在一个GPU上,而是: - 在快速本地磁盘上存储模型权重 - 仅在请求时将模型加载到GPU内存中 - 保持一个小的工作集常驻 - 积极驱逐不活跃的模型 - 通过一个单一的与OpenAI兼容的端点进行路由 在我们最近的测试设置中(2×A6000,每个48GB),我们提供了大约60个Hugging Face文本模型供激活。任何时候只有少数模型驻留在显存中,其余的在需要时恢复。 冷启动仍然存在。较大的模型恢复需要几秒钟。但通过避免为每个模型设置热池和专用GPU,轻负载下的整体利用率显著提高。 这里有个简短的演示:<a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk" rel="nofollow">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk</a> 可以互动的实时演示:<a href="https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;" rel="nofollow">https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;</a> 如果这里有人正在进行多模型推理,并希望用自己的模型来基准测试这种方法,我很乐意提供临时访问权限进行测试。
查看原文
Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.<p>We experimented with a different approach.<p>Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpoint<p>In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.<p>Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.<p>Short demo here:<a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk" rel="nofollow">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=IL7mBoRLHZk</a><p>Live demo to play with: <a href="https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;" rel="nofollow">https:&#x2F;&#x2F;inferx.net:8443&#x2F;demo&#x2F;</a><p>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.