HackerNews中文版

云端人工智能的定价是按令牌计费的。你的工作流程越有用，成本就越高。我构建了一种双模型编排模式，将80%的工作分配给一个免费的本地模型（Ollama上的Qwen3 8B，支持GPU加速），仅将合成/判断阶段发送到云API。一个包含50项研究的工作流程成本为0.15-0.40美元，而全云方案则为8-15美元。在重要的输出质量上，两者相同。技术栈：RTX 5080笔记本，使用Docker的Ollama（支持GPU直通），PostgreSQL，Redis，以及用于最后20%的Claude API。工作模式：本地扫描 → 本地评分 → 本地去重 → 通过云合成。四个阶段中，有三个是免费的。遇到的问题：Qwen3通过/api/generate的思考令牌（应使用/api/chat），Docker绑定仅支持IPv4，而Windows将localhost解析为IPv6，以及消费级显卡的GPU内存限制。欢迎在评论中分享架构细节。

查看原文

Cloud AI pricing is per-token. The more useful your pipeline, the more it costs. I built a dual-model orchestration pattern that routes 80% of work to a free local model (Qwen3 8B on Ollama, GPU-accelerated) and only sends the synthesis/judgment stage to a cloud API.Cost for a 50-item research pipeline: $0.15-0.40 vs $8-15 all-cloud. Same output quality where it matters.Stack: RTX 5080 laptop, Ollama in Docker with GPU passthrough, PostgreSQL, Redis, Claude API for the final 20%.The pattern: scan locally → score locally → deduplicate locally → synthesize via cloud. Four stages, three are free.Gotchas I hit: Qwen3's thinking tokens through /api/generate (use /api/chat instead), Docker binding to IPv4 only while Windows resolves localhost to IPv6, and GPU memory ceilings on consumer cards.Happy to share architecture details in comments.

替代我每月200美元人工智能订阅的2000美元笔记本电脑