展示HN:我们如何在2个GPU上运行60个Hugging Face模型
大多数开源大型语言模型(LLM)部署假设每个GPU只运行一个模型。这在流量稳定的情况下是可行的。但实际上,许多工作负载是长尾或间歇性的,这意味着GPU大部分时间处于闲置状态。
我们尝试了一种不同的方法。
我们并没有将一个模型固定在一个GPU上,而是:
- 在快速本地磁盘上存储模型权重
- 仅在请求时将模型加载到GPU内存中
- 保持一个小的工作集常驻
- 积极驱逐不活跃的模型
- 通过一个单一的与OpenAI兼容的端点进行路由
在我们最近的测试设置中(2×A6000,每个48GB),我们提供了大约60个Hugging Face文本模型供激活。任何时候只有少数模型驻留在显存中,其余的在需要时恢复。
冷启动仍然存在。较大的模型恢复需要几秒钟。但通过避免为每个模型设置热池和专用GPU,轻负载下的整体利用率显著提高。
这里有个简短的演示:<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a>
可以互动的实时演示:<a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a>
如果这里有人正在进行多模型推理,并希望用自己的模型来基准测试这种方法,我很乐意提供临时访问权限进行测试。
查看原文
Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.<p>We experimented with a different approach.<p>Instead of pinning one model to one GPU, we:
•Stage model weights on fast local disk
•Load models into GPU memory only when requested
•Keep a small working set resident
•Evict inactive models aggressively
•Route everything through a single OpenAI-compatible endpoint<p>In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.<p>Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.<p>Short demo here:<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a><p>Live demo to play with: <a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a><p>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.