HackerNews中文版

我们处理来自各种来源的数据和文档，然后： - 将所有文本转换（使用不同的光学字符识别技术） - 将其传递给大型语言模型（LLM）——根据客户的需求，可能会使用更便宜的模型，并且我们确实有模型的备选方案。工程师如何评估这些系统？ 1. 新模型和新库不断涌现。 2. 即使是第三方的部署模型也会随着时间的推移而变化，可能会改善或退化我们的系统。对于这些评估，有什么好的方法吗？

查看原文

So, we process data as well as documents from various sources, then,<p><pre><code> - convert all of its text (using different OCRs) - pass it to LLM models - depending on the customer, it can be a cheaper model, and we do have model fallbacks </code></pre> How do engineers evaluate such systems?<p><pre><code> 1. New models & new libraries are coming all the time 2. Even a third-party's deployment model will change over time and might improve/regress our systems </code></pre> Any good approach for writing evaluations for these?

请问HN：工程师们如何评估基于非确定性机器学习/大语言模型的部署？