代理的简单启发式:人类主导 vs. 人类参与 vs. 代理主导
简而言之——你的代理拥有的自主权越大,你的使用案例就需要越简单。
<p>如今,大多数成功的生产使用案例要么是由人主导,要么是人参与其中。代理主导是可能的,但需要简单的使用案例。</p>
---
<p>人主导:</p>
<p>一个显而易见的例子是ChatGPT。一个输入,一个输出。模型可能会建议后续问题或使用工具,但最终,你是掌控全局的主导者。</p>
---
<p>人参与其中:</p>
<p>最好的例子是Cursor(以及其他编码工具)。编码工具可以为你完成99%的编码工作,使用数十种工具,功能非常强大。但最终,人类仍然需要提供需求,点击“接受”或“拒绝”,并对每次互动给予反馈。</p>
<p>最后一点很重要,因为这是一个实时的再校准过程。</p>
<p>不过,有时候这可能还不够。一个例子是Cursor中Sonnect 3.7的推出。反馈循环与模型自主权的组合不协调。自主权过大,人类的再校准不足。因此用户选择了切换!</p>
---
<p>代理主导:</p>
<p>在这种情况下,代理主导整个任务,用户只是参与者。这很困难,因为再校准较少,因此每次转折出错的概率会增加……这是累积的。</p>
<p>P(全部正常) = pⁿ</p>
<p>p = 代理正常工作</p>
<p>n = 转折次数 / 互动次数</p>
<p>好吧……我将以我的产品为例,不是为了推广,只是因为我非常熟悉它的工作原理。</p>
<p>这是一个运行短客户访谈的聊天代理。我的客户可以根据他们想要了解的内容(例如,客户流失的原因)进行配置,并将其发送给他们的客户。</p>
<p>它是代理主导的,因为:</p>
<p>→ 一旦受访者打开链接,他们就会从那里开始被引导</p>
<p>→ 在每个转折中,代理(而不是人类)决定下一步该做什么</p>
<p>这意味着在10到30次对话转折中(取决于配置)决定正确的做法。例如,正确决定:</p>
<p>→ 是扩展对话还是深入探讨</p>
<p>→ 反思当前进展和上下文</p>
<p>→ 遍历一系列目标并提出问题以引出见解(根据当前目标)</p>
<p>让我们应用上述公式。例子:</p>
<p>假设:</p>
<p>→ n = 20(即对话转折的数量)</p>
<p>→ p = 0.99(即代理正确执行的频率——99%的时间)</p>
<p>那么,P(全部正常) = 0.99²⁰ ≈ 0.82</p>
<p>所以如果我进行100次这样的20次对话,我预计大约82次会按照指示完成,约18次至少会出现一次失误。</p>
<p>让我们将p改为95%……</p>
<p>→ n = 20</p>
<p>→ p = 0.95</p>
<p>P(全部正常) = 0.95²⁰ ≈ 0.358</p>
<p>也就是说,如果我进行100次这样的20次对话,我预计大约36次会顺利完成,约64次至少会出现一次偏离。</p>
<p>我的p值很高。我不得不剔除一些工具并简化,但我做到了。对于我的使用案例,失败只是稍微不相关的回应,所以这是可以管理的。</p>
---
<p>结论:</p>
<p>让代理以99%的准确率做出正确的决策并不简单。</p>
<p>基本上,你不能有一个超级复杂的工作流程。是的,你可以通过引入其他代理来检查工作来减轻这个问题,但这会引入延迟。</p>
<p>总是存在权衡!</p>
<p>了解你正在构建的类别,如果你选择代理主导,尽量缩小你的使用案例。</p>
查看原文
tl;dr - the more agency your agent has, the simpler your use case needs to be<p>Most if not all successful production use cases today are either human-led or human-in-the-loop. Agent-led is possible but requires simplistic use cases.<p>---<p>Human-led:<p>An obvious example is ChatGPT. One input, one output. The model might suggest a follow-up or use a tool but ultimately, you're the master in command.<p>---<p>Human-in-the-loop:<p>The best example of this is Cursor (and other coding tools). Coding tools can do 99% of the coding for you, use dozens of tools, and are incredibly capable. But ultimately the human still gives the requirements, hits "accept" or "reject' AND gives feedback on each interaction turn.<p>The last point is important as it's a live recalibration.<p>This can sometimes not be enough though. An example of this is the rollout of Sonnect 3.7 in Cursor. The feedback loop vs model agency mix was off. Too much agency, not sufficient recalibration from the human. So users switched!<p>---<p>Agent-led:<p>This is where the agent leads the task, end-to-end. The user is just a participant. This is difficult because there's less recalibration so your probability of something going wrong increases on each turn… It's cumulative.<p>P(all good) = pⁿ<p>p = agent works correctly
n = number of turns / interactions<p>Ok… I'm going to use my product as an example, not to promote, I'm just very familiar with how it works.<p>It's a chat agent that runs short customer interviews. My customers can configure it based on what they want to learn (i.e. why a customer churned) and send it to their customers.<p>It's agent-led because<p>→ as soon as the respondent opens the link, they're guided from there
→ at each turn the agent (not the human) is deciding what to do next<p>That means deciding the right thing to do over 10 to 30 conversation turns (depending on config). I.e. correctly decide:<p>→ whether to expand the conversation vs dive deeper
→ reflect on current progress + context
→ traverse a bunch of objectives and ask questions that draw out insight (per current objective)<p>Let's apply the above formula. Example:<p>Let's say:<p>→ n = 20 (i.e. number of conversation turns)
→ p = .99 (i.e. how often the agent does the right thing - 99% of the time)<p>That equals P(all good) = 0.99²⁰ ≈ 0.82<p>So if I ran 100 such 20‑turn conversations, I'd expect roughly 82 to complete as per instructions and about 18 to stumble at least once.<p>Let's change p to 95%...<p>→ n = 20
→ p = .95<p>P(all good) = 0.95²⁰ ≈ 0.358<p>I.e. if I ran 100 such 20‑turn conversations, I’d expect roughly 36 to finish without a hitch and about 64 to go off‑track at least once.<p>My p score is high. I had to strip out a bunch of tools and simplify but I got there. And for my use case, a failure is just a slightly irrelevant response so it's manageable.<p>---<p>Conclusion:<p>Getting an agent to do the correct thing 99% is not trivial.<p>You basically can't have a super complicated workflow. Yes, you can mitigate this by introducing other agents to check the work but this then introduces latency.<p>There's always a tradeoff!<p>Know which category you're building in and if you're going for agent-led, narrow your use-case as much as possible.