请问HN:如何通过编程方式评估一个大型语言模型(LLM)是否听起来“过于像人工智能”?
嗨,HN,
我目前正在构建Aaptics,这是一款旨在帮助创始人撰写内容的工具。最大的工程挑战并不是基础设施,而是让底层模型停止听起来像个企业机器人(例如,避免使用“深入探讨”、“证明”或“在当今快节奏的环境中”等词汇)。
目前,我的工作流程使用了一种自定义的RAG设置,该设置结合了用户过去的写作,配合大量的负提示和少量示例。然而,模型仍然偶尔会滑入那种可识别的“ChatGPT语气”。
对于那些正在构建AI应用程序的朋友们,你们是如何定量评估输出的“人性”的?
你们是否使用LLM作为评判框架?
依赖于特定的温度/ top_p调整?
还是对某些n-gram进行硬编码惩罚?
我希望在四月中旬的发布之前最终确定这个工作流程,欢迎那些在生产中解决过这个问题的朋友分享见解。aaptics.in/waitlist
查看原文
Hi HN,<p>I’m currently building Aaptics, a tool designed to help founders draft content. The biggest engineering challenge hasn't been the infrastructure, but getting the underlying models to stop sounding like a corporate robot (e.g., stopping it from using words like "delve", "testament", or "in today's fast-paced landscape").<p>Right now, my pipeline uses a custom RAG setup that ingests a user's past writing, combined with heavy negative-prompting and few-shot examples. However, the model still occasionally slips into that recognizable "ChatGPT tone."<p>For those of you building AI applications, how are you quantitatively evaluating the "humanness" of your outputs?<p>Are you using LLM-as-a-judge frameworks?<p>Relying on specific temperature/top_p tweaking?<p>Or hardcoding penalizations for certain n-grams?<p>I'm aiming to finalize this pipeline before our mid-April launch and would appreciate any insights from folks who have solved this in production. aaptics.in/waitlist