当我尝试评估一个在生产环境中的AI代理时,发生了什么问题。
我尝试使用基准测试的方法来评估一个人工智能代理。<p>结果出现了我意想不到的失败。<p>大多数失败并不是由于模型质量问题,而是系统层面的问题。以下是一些来自小型测试套件的例子:<p>- 工具调用中的无效 URL → 分数降至 22<p>- 代理在云环境中调用本地主机 → 卡在 46<p>- 被标记为幻觉的真实 CVE → 评估问题,而非模型问题<p>- Reddit 阻止请求 → 外部依赖失败<p>- 生产环境中缺少 API 密钥 → 静默失败<p>每次运行都暴露了一个真实的错误,但并不是我最初想要测量的那种。<p>令我惊讶的是,评估代理不仅仅是对输出进行评分。这是关于验证整个系统:工具、环境、数据访问,以及代理与这些要素的交互方式。<p>换句话说,大多数失败模式更像是软件错误,而不是大型语言模型的错误。<p>这让我思考,代理的评估循环应该更像软件测试,而不是基准测试:
- 可重复的测试套件
- 明确的通过/失败标准
- 回归检测
- 根本原因分析<p>否则,很容易将失败归因于模型,而实际上问题出在其他地方。<p>最后,我构建了一个小工具来结构化这个过程,但对我来说更重要的收获是,现实世界中的代理评估与标准基准相比,实际上是多么复杂。<p>我很好奇其他人是如何处理这个问题的,尤其是在生产环境中。
查看原文
I tried to evaluate an AI agent using a benchmark-style approach.<p>It failed in ways I didn’t expect.<p>Instead of model quality issues, most failures came from system-level problems. A few examples from a small test suite:<p>- Broken URLs in tool calls → score dropped to 22<p>- Agent calling localhost in a cloud environment → got stuck at 46<p>- Real CVEs flagged as hallucinations → evaluation issue, not model issue<p>- Reddit blocking requests → external dependency failure<p>- Missing API key in production → silent failure<p>Each run surfaced a real bug, but not the kind I was originally trying to measure.<p>What surprised me is that evaluating agents isn’t just about scoring outputs. It’s about validating the entire system: tools, environment, data access, and how the agent interacts with all of it.<p>In other words, most of the failure modes looked more like software bugs than LLM mistakes.<p>This made me think that evaluation loops for agents should look more like software testing than benchmarking:
- repeatable test suites
- clear pass/fail criteria
- regression detection
- root cause analysis<p>Otherwise it’s very easy to misattribute failures to the model when they’re actually coming from somewhere else.<p>I ended up building a small tool to structure this process, but the bigger takeaway for me is how messy real-world agent evaluation actually is compared to standard benchmarks.<p>Curious how others are approaching this, especially in production settings.