我如何在四台RTX 3090显卡上运行一个无限制的每月6美元的人工智能服务的故事

4作者: yolo-auto大约 9 小时前原帖
这篇投稿讲述了我如何向大约60位期待已久的用户推出一个无限制的LLM(大型语言模型)服务,随后立即提供了一个完全失效的死循环模型,以及大多数人如何合理地选择离开,但由于有几位非常友好的人仍然留下来,我们得以维持这个项目,尽管现在仍然相当混乱,但正在逐渐获得关注。 稍微回顾一下——我相信AI代理的核心意义在于它们应该持续工作。它们应该读取文件、重试、搜索、编码、总结、运行工具,并循环直到任务完成。当雇主为此付费时,谁在乎成本,但当涉及到我个人的资金或爱好时,如果每次循环都像是一个小的财务事件,你就会开始照看这个代理,而不是使用它,这样就没有乐趣了。 另一方面,计量定价让我担心使用过多。使用订阅让我觉得必须用尽每一个神奇的百分比,否则我就是在“浪费”。如果有一个无限制的服务提供商就好了…… 于是我加入了AMD开发者计划——我获得了一些积分来启动自己的MI300x,并开始在AMD上尝试vllm/sglang推理服务。 在了解AMD MI300x后,我做了一些简单的计算: 租用MI300x每小时2.00美元 = 每月约1500美元。它大概可以支持150个用户使用一个小型MOE模型,比如qwen-35b-3a,甚至可能更多。 1500 / 150 = 每月10.00美元,我们都可以以小价格玩转代理。 你可以稍微超额订阅,所以我定价为每用户每月6美元,提供2个生成槽、128k上下文、没有令牌限制、没有速率限制。 我建立了网站、路由器,创建了等待列表,然后过度优化了MI300x,导致vllm基准测试输出超过3000,吞吐量超过40000……但我没有测试最终的配置/服务命令……这就是我灾难性启动的地方。你无法提示这个模型而不让它陷入循环或出错,它真是个诅咒。正是在这里,我们失去了很多用户。 幸运的是,我的朋友有几块3090显卡,于是他给我提供了救生艇,开始在2块3090上为我们托管qwen,最终我们有了一个不再以每小时2.00美元计费的可操作模型,适合我们这可怜的3个用户。 我们开始吸引更多用户,因此我们升级到了4块3090。我们还有很多空间可以容纳更多用户,但即便如此,自那时以来: 我们配置vllm错误了大约15次 一块GPU坏了 我们失去了电力 我为openclaw、hermes、pi-mono做了一堆一键启动,但没有一个能正常工作,这可能让人们失去了兴趣。这些仍然在我们的网站上。 ……但那些知道自己在做什么的人似乎真的很喜欢这个价格。总的来说,我们的正常运行时间大约有98%。已经过去一个月了。我们都学到了很多,即使我们已经在软件工程/系统工程/人工智能方面有背景,承担几个付费用户的责任迫使我们真正专注于为他们提供良好的产品。现在我觉得我们可能快要达到收支平衡,能够支付电力/托管费用(如果包括3090的资本支出,我们仍然在亏损)。 我们的收支平衡点是迁移到云端,以最大化MI300x的使用,一旦我们获得用户,它就已经调试好并准备好投入使用。 我发现,在某些方面,订阅我们的服务比运行模型更便宜(但作为一个热爱本地模型的人,我完全理解)。 自那时以来,我一直在开发一个实际上可以与小型模型如qwen配合使用的桌面代理——这将取代那些失效的一键启动。它是基础版,但它是一个开箱即用的解决方案。我将其开源,你可以在这里查看我所说的内容:https://github.com/yolo-auto-org/yolo-auto-desktop,我们的网站是yolo-auto.com,并且我们有一个糟糕的免费层来证明它的有效性! 无论如何,希望你能笑一笑或觉得有趣!如果有任何问题,请随时提问。
查看原文
This submission is a tale about how I launched an unlimited LLM provider to about 60 hyped people on the waitlist, then immediately served them a fully dysfunctional death-loop model, and how most people, very reasonably, disappeared, but thanks to a few extremely nice people stuck around anyway, we kept the project alive and its still pretty chaotic but gaining traction.<p>To back up a little bit-- I believe that the whole point of AI agents is that they should keep working. They should read files, retry, search, code, summarize, run tools, and loop until the job is done. When your employer is paying for it, who cares about cost, but when it comes to my personal money&#x2F;hobbies, if every loop feels like a tiny financial event, you start babysitting the agent instead of using it, and its not fun.<p>On the other hand, metered pricing makes me worry about using too much. Usage subscriptions make me feel like I need to use every last magical % or I&#x27;m are &quot;wasting it&quot;. If only an unlimited provider existed....<p>Then I joined the AMD developer program - I got some credits to spin up my own MI300x and started tinkering with vllm&#x2F;sglang inference serving on AMD.<p>After learning about AMD MI300x , i did some napkin math:<p>Renting MI300x at 2.00 an hour = ~$1500 a month . It can probably support about 150 users using a small MOE model, like qwen-35b-3a , maybe more.<p>1500 &#x2F; 150= $10.00 per month, and we all get to play with agents for a small price.<p>You can oversubscribe a bit, so i landed on $6 per month, per user, for 2x generation slots, 128k context, no token limits, no rate limits.<p>I built the site, router, made a waitlist, and then over-optimized the MI300x to the point where vllm bench had like 3k+ output and 40k+ throughput.... But i didn&#x27;t test the final config&#x2F;serve commands... And that&#x27;s where i did a disaster launch. You couldn&#x27;t prompt the thing without it looping or bugging out, it was cursed. And that&#x27;s where we lost alot of people.<p>Luckily, my buddy had a few 3090s, so he threw me a life boat and began hosting qwen for us on 2x 3090s and we finally had an operational model that wasn&#x27;t costing $2.00 an hour for our whopping 3 users.<p>We started gaining a more users, so we moved up to 4x 3090s. Which we have plenty of room for more users, but even so, since then:<p>we&#x27;ve configured vllm wrong like 15 times a GPU died we lost power I made a bunch of one-click starts for openclaw,hermes,pi-mono and none of them really work right and that probably drives people away. Those are still on our site right now.<p>...but people that know what they are doing seem to really be liking the price point. All in all we have like 98% up time. Its been about a month. We&#x27;ve both learned a ton, even already having backgrounds in SWE&#x2F;SE&#x2F;AI , being on the hook for a couple paying users forced us to really focus on delivering them a good product. And now i think we might be close to paying the power&#x2F;hosting bill so we&#x27;re not operating at a loss (if u include 3090 capex were still at aloss).<p>Our break-even point is moving to the cloud to max out a MI300x, which is now tuned and ready to go once we get the users.<p>And im finding in some areas, subscribing to our service is cheaper than running the model (but as someone who loves local models, i totally get it).<p>Since then, I&#x27;ve been working on a desktop agent that actually works with small models like qwen -- thats going to replace the broken 1 click starts. It&#x27;s barebones, but its something out of the box that just works. I made it open source, you can see what im talking about here: https:&#x2F;&#x2F;github.com&#x2F;yolo-auto-org&#x2F;yolo-auto-desktop , we&#x27;re at yolo-auto.com and we have an abysmal free tier to prove it works!<p>Anyway, hope you got a laugh or found it interesting! Drop a question if you have any.