发布 HN:Cekura(YC F24)——语音和聊天 AI 代理的测试与监控

8作者: atarus2 个月前原帖
大家好,我们是来自Cekura的Tarush、Sidhant和Shashij(网址:<a href="https://www.cekura.ai">https://www.cekura.ai</a>)。我们已经运行语音代理模拟1.5年,最近将相同的基础设施扩展到了聊天。团队使用Cekura模拟真实用户对话,压力测试提示和大型语言模型(LLM)的行为,并在问题进入生产环境之前捕捉回归错误。 核心问题是:你无法手动对AI代理进行质量保证。当你发布新的提示、替换模型或添加工具时,如何确保代理在用户可能以成千上万种方式与之互动时仍然表现正确?大多数团队依赖手动抽查(无法扩展)、等待用户投诉(为时已晚)或脆弱的脚本测试。 我们的解决方案是模拟:合成用户以真实用户的方式与您的代理互动,基于LLM的评估者评估其是否正确响应——覆盖整个对话过程,而不仅仅是单个回合。使这一切得以实现的有三个关键点: 1. 场景生成 + 真实对话导入 - 我们的场景生成代理根据您代理的描述启动测试套件。但真实用户会找到生成器未预见的路径,因此我们还会导入您的生产对话并自动从中提取测试用例。您的覆盖率会随着用户的变化而演变。 2. 模拟工具平台 - 代理调用工具。对真实API进行模拟测试速度慢且不稳定。我们的模拟工具平台允许您定义工具架构、行为和返回值,从而使模拟能够进行工具选择和决策,而无需接触生产系统。 3. 确定性、结构化的测试用例 - LLM是随机的。一个“多数情况下”通过的CI测试是无用的。我们的评估者不是自由形式的提示,而是定义为结构化的条件动作树:明确的条件触发特定响应,并支持在逐字精确度重要时使用固定消息。这意味着合成用户在多次运行中表现一致——相同的分支逻辑,相同的输入——因此失败是真正的回归,而不是噪音。 Cekura还监控您的实时代理流量。这里显而易见的替代方案是像Langfuse或LangSmith这样的追踪平台——它们是调试单个LLM调用的优秀工具。但对话代理有不同的失败模式:错误不在于任何单个回合,而在于回合之间的关系。以一个需要姓名、出生日期和电话号码才能继续的验证流程为例——如果代理跳过询问出生日期而继续进行,每个单独的回合在孤立状态下看起来都很好。只有当您将整个会话作为一个单位进行评估时,失败才会显现。Cekura从一开始就围绕这一点构建。 追踪平台逐回合评估,而Cekura则评估整个会话。想象一下一个银行代理,用户在第一步验证中失败,但代理却产生幻觉继续进行。基于回合的评估者看到第三步(地址确认)并将其标记为绿色——正确的问题被问到了。而Cekura的评估者看到完整的记录,并将会话标记为失败,因为验证从未成功。 欢迎您在<a href="https://www.cekura.ai">https://www.cekura.ai</a>上试用我们——7天免费试用,无需信用卡。付费计划起价为每月30美元。 我们还制作了一个产品视频,如果您想看看它的实际效果,可以观看:<a href="https://www.youtube.com/watch?v=n8FFKv1-nMw" rel="nofollow">https://www.youtube.com/watch?v=n8FFKv1-nMw</a>。前一分钟介绍了快速上手,如果您想直接跳到结果,可以跳到8:40。 想知道HN社区在做什么——你们是如何测试代理的行为回归的?哪些失败模式对你们影响最大?欢迎在下面讨论!
查看原文
Hey HN - we&#x27;re Tarush, Sidhant, and Shashij from Cekura (<a href="https:&#x2F;&#x2F;www.cekura.ai">https:&#x2F;&#x2F;www.cekura.ai</a>). We&#x27;ve been running voice agent simulation for 1.5 years, and recently extended the same infrastructure to chat. Teams use Cekura to simulate real user conversations, stress-test prompts and LLM behavior, and catch regressions before they hit production.<p>The core problem: you can&#x27;t manually QA an AI agent. When you ship a new prompt, swap a model, or add a tool, how do you know the agent still behaves correctly across the thousands of ways users might interact with it? Most teams resort to manual spot-checking (doesn&#x27;t scale), waiting for users to complain (too late), or brittle scripted tests.<p>Our answer is simulation: synthetic users interact with your agent the way real users do, and LLM-based judges evaluate whether it responded correctly - across the full conversational arc, not just single turns. Three things make this actually work: Scenario generation + real conversation import - Our scenario generation agent bootstraps your test suite from a description of your agent. But real users find paths no generator anticipates, so we also ingest your production conversations and automatically extract test cases from them. Your coverage evolves as your users do.<p>Mock tool platform - Agents call tools. Running simulations against real APIs is slow and flaky. Our mock tool platform lets you define tool schemas, behavior, and return values so simulations exercise tool selection and decision-making without touching production systems.<p>Deterministic, structured test cases - LLMs are stochastic. A CI test that passes &quot;most of the time&quot; is useless. Rather than free-form prompts, our evaluators are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word-for-word precision matters. This means the synthetic user behaves consistently across runs - same branching logic, same inputs - so a failure is a real regression, not noise.<p>Cekura also monitors your live agent traffic. The obvious alternative here is a tracing platform like Langfuse or LangSmith - and they&#x27;re great tools for debugging individual LLM calls. But conversational agents have a different failure mode: the bug isn&#x27;t in any single turn, it&#x27;s in how turns relate to each other. Take a verification flow that requires name, date of birth, and phone number before proceeding - if the agent skips asking for DOB and moves on anyway, every individual turn looks fine in isolation. The failure only becomes visible when you evaluate the full session as a unit. Cekura is built around this from the ground up. Where tracing platforms evaluate turn by turn, Cekura evaluates the full session. Imagine a banking agent where the user fails verification in step 1, but the agent hallucinates and proceeds anyway. A turn-based evaluator sees step 3 (address confirmation) and marks it green - the right question was asked. Cekura&#x27;s judge sees the full transcript and flags the session as failed because verification never succeeded.<p>Try us out at <a href="https:&#x2F;&#x2F;www.cekura.ai">https:&#x2F;&#x2F;www.cekura.ai</a> - 7-day free trial, no credit card required. Paid plans from $30&#x2F;month.<p>We also put together a product video if you&#x27;d like to see it in action: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=n8FFKv1-nMw" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=n8FFKv1-nMw</a>. The first minute dives into quick onboarding - and if you want to jump straight to the results, skip to 8:40.<p>Curious what the HN community is doing - how are you testing behavioral regressions in your agents? What failure modes have hurt you most? Happy to dig in below!