请问HN:是什么原因导致最近大型语言模型(LLM)编码能力的激增?

5作者: orange_puff大约 8 小时前原帖
看起来我们正处于另一个人工智能的炒作周期中。许多人称当前的编码模型为“拐点”,因为它们的能力已经高到未来模型增长将会爆炸性增长的程度。我听到一些严肃的人士,比如经济学作家诺亚·史密斯,提出了这个观点。 但这不仅仅是评论界的看法。我也看到一些在软件工程和科技领域非常严肃的人讨论他们的编码习惯是如何发生巨大变化的。 单靠基准测试似乎无法捕捉到所有的变化,尽管在自主代理的部分确实有一些跃升,所以也许它们实际上是有反映的。 我的问题是:是什么原因导致许多严肃的人似乎同时注意到这些能力的巨大飞跃?难道只是因为我们对模型投入了足够的数据和计算资源,还是实验室可能在微调模型,使其在工具调用方面表现得更出色,从而导致这种新的、令人惊讶的行为? 当我向人们解释代理时,我通常会带他们经历一个调试代码时可能会进行的手动任务。你将一些代码复制到ChatGPT中,它会询问你更多的上下文,你再复制一些代码进去,它会建议你进行编辑,你编辑并运行,出现错误后你再粘贴错误信息,依此类推。一个代理就是在这个循环中使用工具自动执行这些操作的语言模型。如果我们将像Claude Opus 4.0这样较弱的模型提升到在工具调用方面表现得更优秀10倍,我不会感到惊讶,这将是一个更强大、更令人印象深刻的模型。但这就是全部吗,还是我忽略了什么重要的东西?
查看原文
It seems like we are in the midst of another AI hype cycle. Many people are calling the current coding models an &quot;inflection point&quot;, where now the capabilities are so high that future model growth will be explosive. I have heard serious people, like economics writer Noah Smith, make this argument [0].<p>But it&#x27;s not just the commentariat. I have seen very serious people in software engineering and tech talk about the ways in which their coding habits have change drastically.<p>Benchmarks [1] alone don&#x27;t seem to capture everything, although there have been jumps in the agentic sections, so maybe they actually do.<p>My question is; what explains these big jumps in capabilities that many serious people seem to be noticing all at once? Is it simply that we have thrown enough data and compute at the models, or instead, are labs perhaps fine-tuning models to get really good at tool calls, which leads to this new, surprising behavior?<p>When I explain agents to people, I usually walk them through a manual task one might go through when debugging code. You copy some code into ChatGPT, it asks you for more context, you copy some more code in, it suggests and edit, you edit and run, there is an error, so you paste that in, and so on. An agent is just an LLM in that loop which can use tools to do those things automatically. It would not be shocking to me if we took weaker models like Claude Opus 4.0 and made it 10x better at tool calls, it would be a much stronger and more impressive model. But is that all that is happening, or am I missing something big?<p>[0] https:&#x2F;&#x2F;substack.com&#x2F;@noahpinion&#x2F;p-187818379<p>[1] https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;claude-opus-4-6