请问HN:在约165,000个标记的情况下,Claude Opus 4.6 1M是否优于Opus 4.6 200k?

1作者: consumer4513 天前原帖
这是我无法找到答案的问题,也暂时无法自己解答的内容: 在Claude Code中,我使用Opus 4.6 1M,但通过仔细管理会话保持在250k以下,以避免已知的NoLiMa [0] / 上下文衰退 [1] 问题。然而,我一直想要得到的答案是:在使用约165k个token的情况下,Opus 1M的质量是否真的比Opus 200k更高?(我使用约165k是为了考虑token缓冲和其他因素,但理论上,这个数字也可能是约195k,关键是——在Opus 200k部署的极限情况下) NoLiMa表明,在约165k的请求下,Opus 200k的表现会很差,而Opus 1M会更好(因为使用的上下文窗口的比例较低)……但它们是同一个模型。然而,实际上推理部署的差异可能会改变整个范式,对吗?我感到非常困惑。 Anthropic表示它们是同一个模型 [2]。但是,Claude Code自己的源代码将它们视为具有不同路由的独立变体 [3]。我找到的最接近的测试 [4] 断言它们在200K以下是相同的,但实际上并没有进行A/B测试,对吗? 在Claude Code内部,可能无法进行测试,对吧?根据这个问题 [5],CLI对于相同输入是非确定性的,代理会话在工具使用上会分支。需要一个干净的API级别测试。 *我真正想知道的是关于我自己应用中基于Claude的功能的API级别测试。是否有一个真实的基准?* 我在这个问题上已经达到了理解的极限。如果我所说的有任何道理,任何帮助都将不胜感激。 如果有人能帮我更好地提出这个问题,我也会非常感激。 [0] https://arxiv.org/abs/2502.05167 [1] https://research.trychroma.com/context-rot [2] https://claude.com/blog/1m-context-ga [3] https://github.com/anthropics/claude-code/issues/35545 [4] https://www.claudecodecamp.com/p/claude-code-1m-context-window [5] https://github.com/anthropics/claude-code/issues/3370
查看原文
Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:<p>In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] &#x2F; context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k? (I used ~165k to account for token buffer, and other stuff, but in theory, it may as well be ~195k, the point is - at the limit of the Opus 200k deployment)<p>NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.<p>Anthropic says it&#x27;s the same model [2]. But, Claude Code&#x27;s own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they&#x27;re identical below 200K but it never actually A&#x2F;B tests, correct?<p>Inside Claude Code it&#x27;s probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.<p><i>The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?</i><p>I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.<p>If anyone could help me ask the question better, that would also be appreciated.<p>[0] https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2502.05167<p>[1] https:&#x2F;&#x2F;research.trychroma.com&#x2F;context-rot<p>[2] https:&#x2F;&#x2F;claude.com&#x2F;blog&#x2F;1m-context-ga<p>[3] https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;claude-code&#x2F;issues&#x2F;35545<p>[4] https:&#x2F;&#x2F;www.claudecodecamp.com&#x2F;p&#x2F;claude-code-1m-context-window<p>[5] https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;claude-code&#x2F;issues&#x2F;3370