HackerNews中文版

嗨，HN，我目前正在构建Aaptics，这是一款旨在帮助创始人撰写内容的工具。最大的工程挑战并不是基础设施，而是让底层模型停止听起来像个企业机器人（例如，避免使用“深入探讨”、“证明”或“在当今快节奏的环境中”等词汇）。目前，我的工作流程使用了一种自定义的RAG设置，该设置结合了用户过去的写作，配合大量的负提示和少量示例。然而，模型仍然偶尔会滑入那种可识别的“ChatGPT语气”。对于那些正在构建AI应用程序的朋友们，你们是如何定量评估输出的“人性”的？你们是否使用LLM作为评判框架？依赖于特定的温度/ top_p调整？还是对某些n-gram进行硬编码惩罚？我希望在四月中旬的发布之前最终确定这个工作流程，欢迎那些在生产中解决过这个问题的朋友分享见解。aaptics.in/waitlist

查看原文

Hi HN,I’m currently building Aaptics, a tool designed to help founders draft content. The biggest engineering challenge hasn't been the infrastructure, but getting the underlying models to stop sounding like a corporate robot (e.g., stopping it from using words like "delve", "testament", or "in today's fast-paced landscape").Right now, my pipeline uses a custom RAG setup that ingests a user's past writing, combined with heavy negative-prompting and few-shot examples. However, the model still occasionally slips into that recognizable "ChatGPT tone."For those of you building AI applications, how are you quantitatively evaluating the "humanness" of your outputs?Are you using LLM-as-a-judge frameworks?Relying on specific temperature/top_p tweaking?Or hardcoding penalizations for certain n-grams?I'm aiming to finalize this pipeline before our mid-April launch and would appreciate any insights from folks who have solved this in production. aaptics.in/waitlist

请问HN：如何通过编程方式评估一个大型语言模型（LLM）是否听起来“过于像人工智能”？