发布 HN:Augento(YC W25)– 使用强化学习微调您的智能体

12作者: lmeierhoefer2 天前原帖
嗨,HN,我们是Augento的联合创始人(<a href="https://augento.ai">https://augento.ai</a>)。我们正在构建类似Deepseek R1的微调服务。您只需连接您的代理,告诉我们它的表现是对还是错,我们就会为该代理提供优化的语言模型(LLM)。这里有一个演示视频 <a href="https://www.youtube.com/watch?v=j5RQaTdRrKE" rel="nofollow">https://www.youtube.com/watch?v=j5RQaTdRrKE</a>,我们的文档可以在 <a href="https://docs.augento.ai">https://docs.augento.ai</a> 找到。任何人都可以在 <a href="https://augento.ai">https://augento.ai</a> 上使用我们的服务。 代理经常出现失败,尤其是在您尝试将其用于实际有用的任务时。目前的解决方案效果不佳:提示方法存在固有的局限性,而监督微调需要大量难以收集的显式数据集。 两个月前,DeepSeek R1的论文提出了一种几乎完全基于强化学习的后训练LLM的方法。我们借鉴了他们的研究,围绕这一点构建了一个微调平台。 您可以让我们拦截您的代理的数据流,我们将为您提供一个经过微调的开源模型,该模型专门针对代理的特定任务进行训练。您无需提供大量显式微调样本的数据集,只需提供一个奖励函数来评估模型的输出。 以下是一些可以使用此服务的示例: - **编码代理**:我们对一个不断出现语法错误并且无法正确处理语义边缘情况的编码代理进行了微调。通过提供一个根据编译器评估代码的奖励函数,代理学会了不再产生这些错误。经过微调的模型在仅使用20个训练样本的情况下,将关键错误减少了40%。 - **MCP工具专业化**:假设您有一套使用MCP协议的自定义内部工具,但您的代理总是选择错误的工具或传递不兼容的参数。您可以通过一个对工具选择和参数匹配进行评分的奖励函数进行微调。 - **浏览器代理导航**:如果您正在构建一个在复杂网页用户界面或特定网站上表现不佳的浏览器代理,您可以对其进行微调,以更好地理解用户界面元素和导航模式。通过一个对成功完成任务(如“找到该产品的最佳价格”或“完成此多步骤表单”)进行评分的奖励函数,您可以训练一个更好地识别可点击元素、理解表单验证错误并在复杂的单页面应用中顺利导航的代理。 - **VLA机器人控制**:如果您正在使用视觉-语言模型来控制机械臂或其他硬件,您可以针对特定的执行器设置进行微调。通过一个基于高层次任务完成情况的奖励函数,您可以训练一个视觉-语言-动作(VLA)模型,将自然语言命令(如“将红色方块移动到蓝色圆柱体后面”)转换为特定硬件的执行器控制。 正如您从这些示例中看到的,当前的范式最适合“可验证领域”,在这些领域中,可以提供一个明确的函数来评估模型的输出。然而,我们接下来还将支持一种“对齐模式”,在这种模式下,您无需提供奖励函数,只需对代理过去失败的运行提供高层次反馈。只需标记出问题所在,我们将处理其余部分。这使得在不需要编写正式奖励函数的情况下,改进您的代理变得更加容易。 我们的平台本身并不是开源的,但它可以微调开源语言模型。也就是说,它是OpenAI的强化微调API的替代方案,但支持Qwen、LLama、Deepseek等,并且在奖励模型上提供更多的自定义选项。我们向用户收取训练费用以及后续与模型的推理/交互费用($0的月费 + 训练成本 + 推理成本)。 该平台是自助服务,开放使用,您可以访问 <a href="https://augento.ai/dashboard">https://augento.ai/dashboard</a>。我们将为您提供20美元的训练积分,这应该足够您连接代理并在您的用例中实现一些可观察的改进。 我们期待听到您的想法和反馈!
查看原文
Hi HN, we’re the cofounders of Augento (<a href="https:&#x2F;&#x2F;augento.ai&#x2F;">https:&#x2F;&#x2F;augento.ai&#x2F;</a>). We’re building Deepseek R1-like fine-tuning as a service. You connect your agent, tell us when it’s right or wrong, and we deliver an LLM optimized for that agent. There’s a demo video <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=j5RQaTdRrKE" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=j5RQaTdRrKE</a>, and our docs are at <a href="https:&#x2F;&#x2F;docs.augento.ai&#x2F;">https:&#x2F;&#x2F;docs.augento.ai&#x2F;</a>. It’s open for anyone to use at <a href="https:&#x2F;&#x2F;augento.ai">https:&#x2F;&#x2F;augento.ai</a>.<p>Agents fail all the time, especially when you try to use them for something actually useful. Current solution approaches suck: prompting has intrinsic limits and supervised fine-tuning requires big explicit datasets that are hard to collect.<p>Two months ago, the DeepSeek R1 paper outlined a way to post-train LLMs with (almost) pure reinforcement learning. We took up their research and built a fine-tuning platform around that.<p>You let us intercept your agent&#x27;s data flow, and we deliver you a fine-tuned open-source model, that is trained on the agent&#x27;s specific task. Instead of providing big datasets of explicit fine-tuning samples, you provide a reward function, judging the model&#x27;s outputs.<p>Here are examples of what this can be used for:<p>Coding Agent: We fine-tuned a coding agent that was constantly making syntax errors and failed to handle semantic edge cases properly. By providing a reward function that evaluated code against the compiler, the agent learned not to produce these errors. The fine-tuned model reduced critical bugs by 40% with just 20 training samples.<p>MCP Tool Specialization: Imagine you have a custom set of internal tools using the MCP protocol, but your agent keeps selecting the wrong tool or passing incompatible parameters. You could fine-tune with a reward function that scores tool selection and parameter matching.<p>Browser Agent Navigation: If you&#x27;re building a browser agent that struggles with complex web UIs or specific sites, you could fine-tune it to better understand UI elements and navigation patterns. With a reward function that scores successful task completion (like &quot;find the best price for this product&quot; or &quot;complete this multi-step form&quot;), you could train an agent that better identifies clickable elements, understands form validation errors, and navigates through complex SPAs without getting stuck.<p>VLA Robot Control: If you&#x27;re using vision-language models to control robotic arms or other hardware, you could fine-tune for your specific actuator setup. With a reward function based on high-level task completion, you could train a Vision-Langauge-Action (VLA) model that translates natural language commands like &quot;move the red block behind the blue cylinder&quot; into actuator controls for your specific hardware.<p>As you see from these examples, the current paradigm is best suited for &quot;verifiable domains”, where it is possible to give an explicit function judging the model’s outputs. However, up next, we will also support an &quot;alignment mode&quot;, where you don&#x27;t have to provide a reward function but provide high-level feedback on past failure runs of your agent. Just tag where things went wrong, and we&#x27;ll handle the rest. This makes it even easier to improve your agents without needing to write formal reward functions.<p>Our platform is not itself open source, but it fine-tunes open-source language models. I.e. it is an alternative to the reinforcement fine-tuning API from OpenAI, but with Qwen, LLama, Deepseek, etc., and more customizability on the reward model. We charge users for the training and for their inference&#x2F;interaction with the model later on ($0 monthly flat fee + training cost + inference cost).<p>The platform is self-serving and open to use at <a href="https:&#x2F;&#x2F;augento.ai&#x2F;dashboard">https:&#x2F;&#x2F;augento.ai&#x2F;dashboard</a>. We’ll give you $20 in training credits, which should be enough for connecting your agent and delivering some observable improvement on your use case.<p>We’d love to hear your thoughts and feedback!