HackerNews中文版

大多数开源大型语言模型（LLM）部署假设每个GPU只运行一个模型。这在流量稳定的情况下是可行的。但实际上，许多工作负载是长尾或间歇性的，这意味着GPU大部分时间处于闲置状态。我们尝试了一种不同的方法。我们并没有将一个模型固定在一个GPU上，而是： - 在快速本地磁盘上存储模型权重 - 仅在请求时将模型加载到GPU内存中 - 保持一个小的工作集常驻 - 积极驱逐不活跃的模型 - 通过一个单一的与OpenAI兼容的端点进行路由在我们最近的测试设置中（2×A6000，每个48GB），我们提供了大约60个Hugging Face文本模型供激活。任何时候只有少数模型驻留在显存中，其余的在需要时恢复。冷启动仍然存在。较大的模型恢复需要几秒钟。但通过避免为每个模型设置热池和专用GPU，轻负载下的整体利用率显著提高。这里有个简短的演示：<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a> 可以互动的实时演示：<a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a> 如果这里有人正在进行多模型推理，并希望用自己的模型来基准测试这种方法，我很乐意提供临时访问权限进行测试。

查看原文

Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time.We experimented with a different approach.Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpointIn our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.Short demo here:<a href="https://m.youtube.com/watch?v=IL7mBoRLHZk" rel="nofollow">https://m.youtube.com/watch?v=IL7mBoRLHZk</a>Live demo to play with: <a href="https://inferx.net:8443/demo/" rel="nofollow">https://inferx.net:8443/demo/</a>If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.

展示HN：我们如何在2个GPU上运行60个Hugging Face模型