我厌倦了大型语言模型的技能混乱,因此我自己构建了一个,并进行了回归测试。

3作者: iliaov25 天前原帖
我最近尝试了像Garry Tan的GStack这样的技能,花了一周时间使用它,意识到它存在一些缺陷(我会单独发帖讨论这个问题)。 我的问题是:我怎么知道一个技能或提示是否好用(例如,GStack的 /office-hours)? 我该如何比较类似的技能(例如,不同的“深度研究”技能)? 识别出故障软件(相对来说)是简单的——它会崩溃,打印错误信息。而故障的技能却不会。那些经过精心打磨、听起来很自信的技能常常误导我,浪费我的时间,甚至让我希望自己根本不使用LLM(大型语言模型)。 AI技能是软件——它们应该配备回归测试。 LLM团队有大量的提示回归测试。LLM包装的SaaS公司也有大量的提示回归测试。但在开源技能方面,SKILL.md看起来合理,但却没有任何测试(例如,撰写时GStack的 /office-hours没有任何测试)。 Garry Tan,如果你听到我的声音——请考虑为你的 /office-hours、/plan-ceo-review、/plan-eng-review等技能提供回归测试。 回归测试应该: 1. 证明技能正确运行 2. 演示正确和错误的用法 3. 证明技能的价值 4. 附带评分标准,以便进行技能基准测试 5. 最后一条是最有价值的,因为它可以让你将类似技能进行对比。 所以我开始自己做这件事。 以下是一个正在进行中的示例:plan-cmo-review,这是一个补充GStack的技能,因为在撰写时GStack缺少市场评审。我并不是一个市场营销专家;分享这个技能的目的是概述它的回归设置。 简要来说,我的探索过程如下: - 我在几个产品上使用了GStack,意识到生成的design_document.md让我失败,主要是在市场营销方面。 - 我借助Claude Opus 4.8手动深入分析了技能的失败,最终找到了正确的解决方案。 - 我请Claude构建了一个plan-cmo-review技能,运行后得到了一个有缺陷的解决方案(类似于GStack的输出)。 - 我给Claude提供了正确的(手动)解决方案进行分析,并将其作为回归测试的基准。 - Claude进行了(盲)回归测试——失败了。我们进行了多次迭代,找到了关键问题:Claude盲目相信我的提示是最终真理。Claude认为GStack知道自己在做什么,而GStack相信我知道自己在做什么。但我实际上是在进行产品/初创公司的研究——根据定义,“研究”就是在你不知道自己在做什么时所做的事情。这个信任链就是导致技能失败的原因。 - 我们解决了信任问题,回归测试通过了。我们又添加了几个,结果也通过了。 - 我让Claude多次运行回归测试——出现了裂缝。Claude对技能进行了迭代。现在它们通过了。 - 这种方法论仍然存在缺陷。我想尝试运行不同的LLM,进行跨模型评估,以及更多的回归测试。 技能链接:github.com/remakeai/plan-cmo-review。更多笔记见:iliaov.substack.com。
查看原文
I&#x27;ve recently tried skills like Garry Tan&#x27;s GStack, spent a week with it, and realized it has some flaws (I&#x27;ll post separately about that).<p>Here&#x27;s my problem: how do I know if a skill or prompt is any good (e.g. GStack&#x27;s &#x2F;office-hours)?<p>How do I compare similar skills (e.g. different &quot;deep research&quot; skills)?<p>Spotting broken software is (relatively) easy — it crashes, prints errors. Broken skills don&#x27;t. Perfectly polished, confident-sounding skills routinely mislead me and waste my time, to the point where I wish I weren&#x27;t using an LLM at all.<p>AI skills are software — and they should come with regression tests.<p>LLM teams have tons of prompt regression tests. LLM-wrapper SaaS companies have tons of prompt regression tests. But when it comes to open-source skills, SKILL.md reads reasonable, yet ships with zero tests (e.g. GStack&#x27;s &#x2F;office-hours has none at the time of writing).<p>Garry Tan, if you hear me — please consider shipping regression tests for your &#x2F;office-hours, &#x2F;plan-ceo-review, &#x2F;plan-eng-review, and so on.<p>Regression tests should:<p>1. Prove the skill works correctly<p>2. Demonstrate correct and incorrect usage<p>3. Prove the skill&#x27;s value<p>4. Come with a scoring rubric to allow skill benchmarking<p>5. The last one is the most valuable, because it lets you benchmark similar skills against each other.<p>So I started doing this myself.<p>Here&#x27;s a work-in-progress example: plan-cmo-review, a skill to complement GStack since GStack is missing a marketing review at the time of writing. I&#x27;m not a marketing guy; the point of sharing this skill is to outline its regression setup.<p>Briefly, here&#x27;s how my exploration progressed:<p>- I used GStack on a couple of products and realized the resulting design_document.md was leading me to failure, mainly marketing-wise.<p>- I dug into the skill&#x27;s failures manually with Claude Opus 4.8&#x27;s help and ended up finding the correct solution.<p>- I asked Claude to build a plan-cmo-review skill, ran it, and it arrived at a flawed solution (similar to GStack&#x27;s output).<p>- I gave Claude the correct (manual) solution to analyze and add as a regression fixture with a scoring rubric.<p>- Claude ran the (blind) regression — it failed. We iterated several times and found the key problem: Claude was trusting my prompts implicitly as the ultimate truth. Claude believed GStack knew what it was doing. GStack believed I knew what I was doing. But I was doing product&#x2F;startup research — and by definition, &quot;research&quot; is what you do when you don&#x27;t know what you&#x27;re doing. That trust chain is what broke the skills.<p>- We fixed the trust problem and the regression test passed. We added a few more. They passed.<p>- I had Claude run the regressions multiple times — cracks appeared. Claude iterated the skill. Now they pass.<p>- This methodology is still flawed. I&#x27;d like to try running different LLMs, cross-model judging, and a lot more regression tests.<p>Skill github.com&#x2F;remakeai&#x2F;plan-cmo-review . Notes at iliaov.substack.com .