请问HN:如何通过编程方式评估一个大型语言模型(LLM)是否听起来“过于像人工智能”?

1作者: shubhamoriginx3 天前原帖
嗨,HN, 我目前正在构建Aaptics,这是一款旨在帮助创始人撰写内容的工具。最大的工程挑战并不是基础设施,而是让底层模型停止听起来像个企业机器人(例如,避免使用“深入探讨”、“证明”或“在当今快节奏的环境中”等词汇)。 目前,我的工作流程使用了一种自定义的RAG设置,该设置结合了用户过去的写作,配合大量的负提示和少量示例。然而,模型仍然偶尔会滑入那种可识别的“ChatGPT语气”。 对于那些正在构建AI应用程序的朋友们,你们是如何定量评估输出的“人性”的? 你们是否使用LLM作为评判框架? 依赖于特定的温度/ top_p调整? 还是对某些n-gram进行硬编码惩罚? 我希望在四月中旬的发布之前最终确定这个工作流程,欢迎那些在生产中解决过这个问题的朋友分享见解。aaptics.in/waitlist
查看原文
Hi HN,<p>I’m currently building Aaptics, a tool designed to help founders draft content. The biggest engineering challenge hasn&#x27;t been the infrastructure, but getting the underlying models to stop sounding like a corporate robot (e.g., stopping it from using words like &quot;delve&quot;, &quot;testament&quot;, or &quot;in today&#x27;s fast-paced landscape&quot;).<p>Right now, my pipeline uses a custom RAG setup that ingests a user&#x27;s past writing, combined with heavy negative-prompting and few-shot examples. However, the model still occasionally slips into that recognizable &quot;ChatGPT tone.&quot;<p>For those of you building AI applications, how are you quantitatively evaluating the &quot;humanness&quot; of your outputs?<p>Are you using LLM-as-a-judge frameworks?<p>Relying on specific temperature&#x2F;top_p tweaking?<p>Or hardcoding penalizations for certain n-grams?<p>I&#x27;m aiming to finalize this pipeline before our mid-April launch and would appreciate any insights from folks who have solved this in production. aaptics.in&#x2F;waitlist