启动 HN:Canary(YC W26) – 理解您代码的 AI 问答系统
嗨,HN!我们是Aakash和Viswesh,我们正在构建Canary(<a href="https://www.runcanary.ai">https://www.runcanary.ai</a>)。我们开发的AI代理可以读取你的代码库,识别拉取请求(PR)实际更改了什么,并为每个受影响的用户工作流程生成并执行测试。
Aakash和我之前在Windsurf、Cognition和Google开发过AI编码工具。AI工具使每个团队在交付上变得更快,但在合并之前,没有人测试真实用户的行为。PR变得越来越大,审查仍然是在文件差异中进行的,而看似干净的更改在生产环境中却导致了结账、身份验证和计费等问题。我们亲眼目睹了这一切。我们创建Canary就是为了填补这个空白。以下是它的工作原理:
Canary首先连接到你的代码库,并理解你的应用是如何构建的:路由、控制器、验证逻辑。你推送一个PR,Canary读取差异,理解更改背后的意图,然后生成并在你的预览应用上运行测试,检查真实用户的完整流程。它会直接在PR上发表评论,提供测试结果和录屏,展示更改内容,并标记任何不符合预期行为的部分。你还可以通过PR评论触发特定的用户工作流程测试。
除了PR测试外,从PR生成的测试可以移入回归测试套件。你也可以通过简单的英文提示创建测试。Canary会从你的代码库生成完整的测试套件,安排并持续运行它。我们的一个建筑科技客户在发票流程中发现应付金额与原始提案总额偏差了约1600美元。Canary在发布之前捕捉到了他们发票流程中的回归问题。
这并不是单一的基础模型能够独立完成的任务。质量保证(QA)涉及多个模态,如源代码、DOM/ARIA、设备模拟器、视觉验证、分析屏幕录制、网络/控制台日志、实时浏览器状态等,任何单一模型都难以专注于这些。你还需要定制的浏览器集群、用户会话、临时环境、设备农场和数据预置,以可靠地运行测试。此外,捕捉代码更改的二次效应需要一个专门的工具,以多种可能的方式破坏应用程序,而普通的顺利路径测试流程无法做到这一点。
为了衡量我们专门构建的QA代理的效果,我们发布了QA-Bench v0,这是第一个代码验证基准。给定一个真实的PR,AI模型能否识别每个受影响的用户工作流程并生成相关测试?我们将我们的专用QA代理与GPT 5.4、Claude Code(Opus 4.6)和Sonnet 4.6进行了测试,涵盖了Grafana、Mattermost、Cal.com和Apache Superset上的35个真实PR,从相关性、覆盖率和一致性三个维度进行评估。覆盖率是差距最大的地方。Canary在覆盖率上领先GPT 5.4 11分,领先Claude Code 18分,领先Sonnet 4.6 26分。有关完整的方法论和每个代码库的详细分析,请阅读我们的基准报告:<a href="https://www.runcanary.ai/blog/qa-bench-v0">https://www.runcanary.ai/blog/qa-bench-v0</a>
你可以在这里查看产品演示:<a href="https://youtu.be/NeD9g1do_BU" rel="nofollow">https://youtu.be/NeD9g1do_BU</a>
我们非常希望听到任何在代码验证方面工作或考虑如何以不同方式衡量此事的人的反馈。
查看原文
Hey HN! We're Aakash and Viswesh, and we're building Canary (<a href="https://www.runcanary.ai">https://www.runcanary.ai</a>). We build AI agents that read your codebase, figure out what a pull request actually changed, and generate and execute tests for every affected user workflow.<p>Aakash and I previously built AI coding tools at Windsurf, Cognition, and Google. AI tools were making every team faster at shipping, but nobody was testing real user behavior before merge. PRs got bigger, reviews still happened in file diffs, and changes that looked clean broke checkout, auth, and billing in production. We saw it firsthand. We started Canary to close that gap. Here's how it works:<p>Canary starts by connecting to your codebase and understands how your app is built: routes, controllers, validation logic. You push a PR and Canary reads the diff, understands the intent behind the changes, then generates and runs tests against your preview app checking real user flows end to end. It comments directly on the PR with test results and recordings showing what changed and flagging anything that doesn't behave as expected. You can also trigger specific user workflow tests via a PR comment.<p>Beyond PR testing, tests generated from the PR can be moved into regression suites. You can also create tests by just prompting what you want tested in plain English. Canary generates a full test suite from your codebase, schedules it, and runs it continuously. One of our construction tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.<p>This isn't something a single family of foundation models can do on its own. QA spans across many modalities like source code, DOM/ARIA, device emulators, visual verifications, analyzing screen recordings, network/console logs, live browser state etc. for any single model to be specialized in. You also need custom browser fleets, user sessions, ephemeral environments, on-device farms and data seeding to run the tests reliably. On top of that, catching second-order effects of code changes requires a specialized harness that breaks the application in multiple possible ways across different types of users that a normal happy path testing flow wouldn't.<p>To measure how well our purpose built QA agent works, we published QA-Bench v0, the first benchmark for code verification. Given a real PR, can an AI model identify every affected user workflow and produce relevant tests? We tested our purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset on three dimensions: Relevance, Coverage, and Coherence. Coverage is where the gap was largest. Canary leads by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6. For full methodology and per-repo breakdowns give our benchmark report a read: <a href="https://www.runcanary.ai/blog/qa-bench-v0">https://www.runcanary.ai/blog/qa-bench-v0</a><p>You can check out the product demo here: <a href="https://youtu.be/NeD9g1do_BU" rel="nofollow">https://youtu.be/NeD9g1do_BU</a><p>We'd love feedback from anyone working on code verification or thinking about how to measure this differently.