HackerNews中文版

我尝试使用基准测试的方法来评估一个人工智能代理。结果出现了我意想不到的失败。大多数失败并不是由于模型质量问题，而是系统层面的问题。以下是一些来自小型测试套件的例子：- 工具调用中的无效 URL → 分数降至 22- 代理在云环境中调用本地主机 → 卡在 46- 被标记为幻觉的真实 CVE → 评估问题，而非模型问题- Reddit 阻止请求 → 外部依赖失败- 生产环境中缺少 API 密钥 → 静默失败每次运行都暴露了一个真实的错误，但并不是我最初想要测量的那种。令我惊讶的是，评估代理不仅仅是对输出进行评分。这是关于验证整个系统：工具、环境、数据访问，以及代理与这些要素的交互方式。换句话说，大多数失败模式更像是软件错误，而不是大型语言模型的错误。这让我思考，代理的评估循环应该更像软件测试，而不是基准测试： - 可重复的测试套件 - 明确的通过/失败标准 - 回归检测 - 根本原因分析否则，很容易将失败归因于模型，而实际上问题出在其他地方。最后，我构建了一个小工具来结构化这个过程，但对我来说更重要的收获是，现实世界中的代理评估与标准基准相比，实际上是多么复杂。我很好奇其他人是如何处理这个问题的，尤其是在生产环境中。

查看原文

I tried to evaluate an AI agent using a benchmark-style approach.It failed in ways I didn’t expect.Instead of model quality issues, most failures came from system-level problems. A few examples from a small test suite:- Broken URLs in tool calls → score dropped to 22- Agent calling localhost in a cloud environment → got stuck at 46- Real CVEs flagged as hallucinations → evaluation issue, not model issue- Reddit blocking requests → external dependency failure- Missing API key in production → silent failureEach run surfaced a real bug, but not the kind I was originally trying to measure.What surprised me is that evaluating agents isn’t just about scoring outputs. It’s about validating the entire system: tools, environment, data access, and how the agent interacts with all of it.In other words, most of the failure modes looked more like software bugs than LLM mistakes.This made me think that evaluation loops for agents should look more like software testing than benchmarking: - repeatable test suites - clear pass/fail criteria - regression detection - root cause analysisOtherwise it’s very easy to misattribute failures to the model when they’re actually coming from somewhere else.I ended up building a small tool to structure this process, but the bigger takeaway for me is how messy real-world agent evaluation actually is compared to standard benchmarks.Curious how others are approaching this, especially in production settings.

当我尝试评估一个在生产环境中的AI代理时，发生了什么问题。