请问HN:你们在进行AI评估时使用哪些工具?感觉一切都不够成熟。
我们正在生产环境中运行大型语言模型(LLMs)用于内容生成、客户支持和代码审查辅助。我们尝试了几个月来建立一个合适的评估管道,但我们测试的每个工具都有显著的局限性。
我们评估过的工具:
- OpenAI的Evals框架:在基准测试方面表现良好,但在定制用例上存在挑战。通过YAML文件进行配置可能会很复杂,扩展功能需要深入其代码库。主要设计用于批处理,而非实时监控。
- LangSmith:具有强大的追踪能力,但评估功能似乎次于其可观察性重点。免费层之后,定价从每千条追踪0.50美元起,对于高容量使用来说,费用迅速累积。处理较大数据集时,用户界面可能会变得缓慢。
- Weights & Biases:强大的平台,但主要用于传统机器学习实验跟踪。设置复杂,需要较高的机器学习专业知识。我们的产品团队在有效使用上遇到困难。
- Humanloop:界面简洁,专注于提示版本控制,具备基本的评估能力。可用的评估类型有限,功能集的定价较高。
- Braintrust:对评估的有趣方法,但感觉像是一个早期阶段的产品。文档稀少,集成选项有限。
我们实际需要的:
- 实时评估监控(不仅仅是批处理)
- 不需要博士级设置的自定义评估功能
- 针对主观任务的人机协作工作流程
- 每个模型/提示的成本跟踪
- 与我们现有可观察性栈的集成
- 产品团队能够实际使用的工具
当前解决方案:
自定义脚本 + 基本指标的监控仪表板。每周在电子表格中进行手动审查。虽然可行,但无法扩展,并且我们会错过一些边缘情况。
有没有人找到能够很好处理生产环境中LLM评估的工具?我们是否期望过高,还是这些工具确实不成熟?特别希望听到没有专职机器学习工程师的团队的反馈。
查看原文
We're running LLMs in production for content generation, customer support, and code review assistance. Been trying to build a proper evaluation pipeline for months but every tool we've tested has significant limitations.<p>What we've evaluated:<p>- OpenAI's Evals framework: Works well for benchmarking but challenging for custom use cases. Configuration through YAML files can be complex and extending functionality requires diving deep into their codebase. Primarily designed for batch processing rather than real-time monitoring.<p>- LangSmith: Strong tracing capabilities but eval features feel secondary to their observability focus. Pricing starts at $0.50 per 1k traces after the free tier, which adds up quickly with high volume. UI can be slow with larger datasets.<p>- Weights & Biases: Powerful platform but designed primarily for traditional ML experiment tracking. Setup is complex and requires significant ML expertise. Our product team struggles to use it effectively.<p>- Humanloop: Clean interface focused on prompt versioning with basic evaluation capabilities. Limited eval types available and pricing is steep for the feature set.<p>- Braintrust: Interesting approach to evaluation but feels like an early-stage product. Documentation is sparse and integration options are limited.<p>What we actually need:
- Real-time eval monitoring (not just batch)
- Custom eval functions that don't require PhD-level setup
- Human-in-the-loop workflows for subjective tasks
- Cost tracking per model/prompt
- Integration with our existing observability stack
- Something our product team can actually use<p>Current solution:<p>Custom scripts + monitoring dashboards for basic metrics. Weekly manual reviews in spreadsheets. It works but doesn't scale and we miss edge cases.<p>Has anyone found tools that handle production LLM evaluation well? Are we expecting too much or is the tooling genuinely immature? Especially interested in hearing from teams without dedicated ML engineers.