问HN:你们是如何在生产环境中防止大规模语言模型(LLM)工作流失控的?

1作者: HenryM122 个月前原帖
我们一直在将基于大型语言模型(LLM)的工作流程推向生产环境,并开始遇到一些仅靠可观察性无法解决的可靠性问题。<p>例如:<p>- 循环未能正常终止<p>- 重试在工具调用之间级联<p>- 单个工作流程中的成本逐渐上升<p>- 代理进行技术上“允许”但不理想的调用<p>在这里,监控是可以的。我们可以看到发生了什么。更难的部分是决定执行边界实际在哪里。<p>目前,我们的大多数关闭路径仍然感觉是手动的,比如功能标志、撤销密钥、限制上游速率等。<p>想了解其他团队是如何实际处理这些问题的:<p>- 你们的执行单元是什么?工具调用、工作流程、容器,还是其他?<p>- 你们有自动终止条件吗?<p>- 你们是内部构建这一层的吗?<p>- 随着复杂性的增加,你们是否需要多次重新审视它?<p>- 随着工作流程跨越更多工具或服务,情况是否变得更糟?<p>非常希望听到在生产环境中运行代理的团队的具体经验。我们真的只是想弄清楚如何扩展。
查看原文
We’ve been pushing LLM backed workflows into production and are starting to run into reliability edges that observability alone doesn’t solve.<p>Things like:<p>- loops that don’t terminate cleanly<p>- retries cascading across tool calls<p>- cost creeping up inside a single workflow<p>- agents making technically “allowed” but undesirable calls<p>Monitoring here is fine. We can see what’s happening. The harder part is deciding where the enforcement boundary actually lives.<p>Right now, most of our shutdown paths still feel manual, things like feature flags, revoking keys, rate limiting upstream, etc.<p>Curious how others are handling these problems in practice:<p>- What’s your enforcement unit? Tool call, workflow, container, something else?<p>- Do you have automated kill conditions?<p>- Did you build this layer internally?<p>- Did you have to revisit it multiple times as complexity increased?<p>- Does it get worse as workflows span more tools or services?<p>Would appreciate any concrete experiences from teams running agents in production. Really just trying to figure out how to scale.