展示HN:InferX – 一款原生AI操作系统,支持每个GPU运行50个大型语言模型并实现热插拔功能。

1作者: pveldandi7 个月前原帖
大家好,我们正在开发InferX,这是一个原生AI运行时,可以快速捕捉大型语言模型(LLM)的完整GPU执行状态(包括权重、KV缓存、CUDA上下文),并在2秒内恢复。这使得我们能够像线程一样热切换模型,无需重新加载,也没有冷启动的时间。 我们将每个模型视为一个轻量级的可恢复进程,类似于LLM推理的操作系统。 为什么这很重要: - 每个GPU可以运行50多个LLM(7B–13B范围) - GPU利用率高达90%(而传统设置大约为30-40%) - 通过在GPU上直接快照和恢复,避免了冷启动 - 设计用于自主工作流、工具链和多租户用例 - 对于Codex CLI风格的编排或突发性多模型应用非常有帮助 虽然还处于早期阶段,但我们已经看到构建者和基础设施人员的强烈兴趣。 欢迎分享您的想法、反馈或希望测试的边缘案例。 演示链接: [https://inferx.net](https://inferx.net) X: @InferXai
查看原文
Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.<p>We treat each model as a lightweight, resumable process. like an OS for LLM inference.<p>Why it matters:<p>•Run 50+ LLMs per GPU (7B–13B range)<p>•90% GPU utilization (vs ~30–40% with conventional setups)<p>•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases<p>•Helpful for Codex CLI-style orchestration or bursty multi-model apps<p>Still early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.<p>Demo: <a href="https:&#x2F;&#x2F;inferx.net" rel="nofollow">https:&#x2F;&#x2F;inferx.net</a> X: @InferXai