HackerNews中文版

1. 规格和计划是源代码：规格和计划与源代码一起存储在git中，而不是在聊天记录中。新的代理会阅读arch.md以了解全局视图，然后查看其具体的规格。你总是能知道某个东西为何被构建。 2. 三个模型审查每个阶段：Claude、Gemini和Codex几乎能发现完全不同的错误。没有任何单一模型发现超过55%的问题。如果你只用编写代码的模型进行审查，你会错过一半的错误。在交付前发现了20个错误。Claude Code发现了5个错误，Gemini和Codex又发现了15个，包括一个Claude遗漏的严重安全问题。 3. 执行流程，而不是建议它。一个状态机强制执行规格 → 计划 → 实施 → 审查 → PR。AI不能跳过步骤。测试必须通过才能继续。AI不会自行遵循计划，你需要设置轨道。 4. 注释，而不是编辑。大部分工作是编写指导代码的规格和审查，而不是在开放式聊天中随意修改文件。 5. 代理协调代理。一个架构代理会在隔离的git工作树中生成构建代理。你指导架构代理；它再指导构建代理。它们之间是异步消息传递。 6. 管理整个生命周期。大多数AI工具帮助你更快地编写代码——这可能只是工作量的30%。其余70%是规划如何进行、审查、集成、部署脚本、管理预生产与生产环境。让AI从规格到PR及之后的整个流程都运行。 总体结果：一名工程师能够完成通常需要3-4人团队才能完成的工作。在10分制中，代码质量比Claude Code高出1.2分。缺点是耗时更长，使用的token更多，但每个PR的费用仍然合理，为1.60美元。我们已将其开源： https://github.com/cluesmith/codev 更多细节和原始结果： https://cluesmith.com/blog/a-tour-of-codevos/

查看原文

1. Specs and plans are source code: Specs and plans live in git alongside source code, not in chat history. A new agent reads arch.md for the big picture, then its specific spec. You always know why something was built.2. Three models review every phase: Claude, Gemini, and Codex catch almost entirely different bugs. No single model found more than 55% of issues. If you only review with the model that wrote the code, you're missing half the bugs. 20 bugs caught before shipping. Claude Code found 5 bugs, Gemini and Codex caught another 15, including a severe security issue Claude missed.3. Enforce the process, don't suggest it. A state machine forces Spec → Plan → Implement → Review → PR. The AI can't skip steps. Tests must pass before advancing. AIs don't stick to the plan by themselves, you need rails.4. Annotate, don't edit. Most of the work is writing specs and reviews that guide the code, not hacking at files in an open-ended chat.5. Agents coordinate agents. An architect agent spawns builder agents into isolated git worktrees. You direct the architect; it directs the builders. They message each other async.6. Manage the whole lifecycle. Most AI tools help you write code faster — maybe 30% of the job. The other 70% is planning how, reviewing, integrating, deployment scripts, managing staging vs prod. Have AI run the whole pipeline from spec to PR and beyond.Overall result: One engineer able to produce what a team of 3-4 would usually do. Measured 1.2 points better code on a 10 point scale vs claude code. Downsides: takes a lot longer, much more token usage, but still reasonable at $1.60 per PR.We open sourced it: https://github.com/cluesmith/codev More details and raw results: https://cluesmith.com/blog/a-tour-of-codevos/

将人工智能从原型转变为工作马的6种实践（14天内提交106个PR）