将人工智能从原型转变为工作马的6种实践(14天内提交106个PR)
1. <i>规格和计划是源代码</i>:规格和计划与源代码一起存储在git中,而不是在聊天记录中。新的代理会阅读arch.md以了解全局视图,然后查看其具体的规格。你总是能知道某个东西为何被构建。
2. <i>三个模型审查每个阶段</i>:Claude、Gemini和Codex几乎能发现完全不同的错误。没有任何单一模型发现超过55%的问题。如果你只用编写代码的模型进行审查,你会错过一半的错误。在交付前发现了20个错误。Claude Code发现了5个错误,Gemini和Codex又发现了15个,包括一个Claude遗漏的严重安全问题。
3. <i>执行流程,而不是建议它</i>。一个状态机强制执行规格 → 计划 → 实施 → 审查 → PR。AI不能跳过步骤。测试必须通过才能继续。AI不会自行遵循计划,你需要设置轨道。
4. <i>注释,而不是编辑</i>。大部分工作是编写指导代码的规格和审查,而不是在开放式聊天中随意修改文件。
5. <i>代理协调代理</i>。一个架构代理会在隔离的git工作树中生成构建代理。你指导架构代理;它再指导构建代理。它们之间是异步消息传递。
6. <i>管理整个生命周期</i>。大多数AI工具帮助你更快地编写代码——这可能只是工作量的30%。其余70%是规划如何进行、审查、集成、部署脚本、管理预生产与生产环境。让AI从规格到PR及之后的整个流程都运行。
<i>总体结果</i>:一名工程师能够完成通常需要3-4人团队才能完成的工作。在10分制中,代码质量比Claude Code高出1.2分。缺点是耗时更长,使用的token更多,但每个PR的费用仍然合理,为1.60美元。
我们已将其开源: https://github.com/cluesmith/codev
更多细节和原始结果: https://cluesmith.com/blog/a-tour-of-codevos/
查看原文
1. <i>Specs and plans are source code</i>: Specs and plans live in git alongside source code, not in chat history. A new agent reads arch.md for the big picture, then its specific spec. You always know why something was built.<p>2. <i>Three models review every phase</i>: Claude, Gemini, and Codex catch almost entirely different bugs. No single model found more than 55% of issues. If you only review with the model that wrote the code, you're missing half the bugs. 20 bugs caught before shipping. Claude Code found 5 bugs, Gemini and Codex caught another 15, including a severe security issue Claude missed.<p>3. <i>Enforce the process, don't suggest it</i>. A state machine forces Spec → Plan → Implement → Review → PR. The AI can't skip steps. Tests must pass before advancing. AIs don't stick to the plan by themselves, you need rails.<p>4. <i>Annotate, don't edit</i>. Most of the work is writing specs and reviews that guide the code, not hacking at files in an open-ended chat.<p>5. <i>Agents coordinate agents</i>. An architect agent spawns builder agents into isolated git worktrees. You direct the architect; it directs the builders. They message each other async.<p>6. <i>Manage the whole lifecycle</i>. Most AI tools help you write code faster — maybe 30% of the job. The other 70% is planning how, reviewing, integrating, deployment scripts, managing staging vs prod. Have AI run the whole pipeline from spec to PR and beyond.<p><i>Overall result</i>: One engineer able to produce what a team of 3-4 would usually do. Measured 1.2 points better code on a 10 point scale vs claude code. Downsides: takes a lot longer, much more token usage, but still reasonable at $1.60 per PR.<p>We open sourced it: https://github.com/cluesmith/codev
More details and raw results: https://cluesmith.com/blog/a-tour-of-codevos/