HackerNews中文版

我们一直在将基于大型语言模型（LLM）的工作流程推向生产环境，并开始遇到一些仅靠可观察性无法解决的可靠性问题。例如：- 循环未能正常终止- 重试在工具调用之间级联- 单个工作流程中的成本逐渐上升- 代理进行技术上“允许”但不理想的调用在这里，监控是可以的。我们可以看到发生了什么。更难的部分是决定执行边界实际在哪里。目前，我们的大多数关闭路径仍然感觉是手动的，比如功能标志、撤销密钥、限制上游速率等。想了解其他团队是如何实际处理这些问题的：- 你们的执行单元是什么？工具调用、工作流程、容器，还是其他？- 你们有自动终止条件吗？- 你们是内部构建这一层的吗？- 随着复杂性的增加，你们是否需要多次重新审视它？- 随着工作流程跨越更多工具或服务，情况是否变得更糟？非常希望听到在生产环境中运行代理的团队的具体经验。我们真的只是想弄清楚如何扩展。

查看原文

We’ve been pushing LLM backed workflows into production and are starting to run into reliability edges that observability alone doesn’t solve.Things like:- loops that don’t terminate cleanly- retries cascading across tool calls- cost creeping up inside a single workflow- agents making technically “allowed” but undesirable callsMonitoring here is fine. We can see what’s happening. The harder part is deciding where the enforcement boundary actually lives.Right now, most of our shutdown paths still feel manual, things like feature flags, revoking keys, rate limiting upstream, etc.Curious how others are handling these problems in practice:- What’s your enforcement unit? Tool call, workflow, container, something else?- Do you have automated kill conditions?- Did you build this layer internally?- Did you have to revisit it multiple times as complexity increased?- Does it get worse as workflows span more tools or services?Would appreciate any concrete experiences from teams running agents in production. Really just trying to figure out how to scale.

问HN：你们是如何在生产环境中防止大规模语言模型（LLM）工作流失控的？