请问HN:在约165,000个标记的情况下,Claude Opus 4.6 1M是否优于Opus 4.6 200k?
这是我无法找到答案的问题,也暂时无法自己解答的内容:
在Claude Code中,我使用Opus 4.6 1M,但通过仔细管理会话保持在250k以下,以避免已知的NoLiMa [0] / 上下文衰退 [1] 问题。然而,我一直想要得到的答案是:在使用约165k个token的情况下,Opus 1M的质量是否真的比Opus 200k更高?(我使用约165k是为了考虑token缓冲和其他因素,但理论上,这个数字也可能是约195k,关键是——在Opus 200k部署的极限情况下)
NoLiMa表明,在约165k的请求下,Opus 200k的表现会很差,而Opus 1M会更好(因为使用的上下文窗口的比例较低)……但它们是同一个模型。然而,实际上推理部署的差异可能会改变整个范式,对吗?我感到非常困惑。
Anthropic表示它们是同一个模型 [2]。但是,Claude Code自己的源代码将它们视为具有不同路由的独立变体 [3]。我找到的最接近的测试 [4] 断言它们在200K以下是相同的,但实际上并没有进行A/B测试,对吗?
在Claude Code内部,可能无法进行测试,对吧?根据这个问题 [5],CLI对于相同输入是非确定性的,代理会话在工具使用上会分支。需要一个干净的API级别测试。
*我真正想知道的是关于我自己应用中基于Claude的功能的API级别测试。是否有一个真实的基准?*
我在这个问题上已经达到了理解的极限。如果我所说的有任何道理,任何帮助都将不胜感激。
如果有人能帮我更好地提出这个问题,我也会非常感激。
[0] https://arxiv.org/abs/2502.05167
[1] https://research.trychroma.com/context-rot
[2] https://claude.com/blog/1m-context-ga
[3] https://github.com/anthropics/claude-code/issues/35545
[4] https://www.claudecodecamp.com/p/claude-code-1m-context-window
[5] https://github.com/anthropics/claude-code/issues/3370
查看原文
Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:<p>In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] / context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k? (I used ~165k to account for token buffer, and other stuff, but in theory, it may as well be ~195k, the point is - at the limit of the Opus 200k deployment)<p>NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.<p>Anthropic says it's the same model [2]. But, Claude Code's own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they're identical below 200K but it never actually A/B tests, correct?<p>Inside Claude Code it's probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.<p><i>The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?</i><p>I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.<p>If anyone could help me ask the question better, that would also be appreciated.<p>[0] https://arxiv.org/abs/2502.05167<p>[1] https://research.trychroma.com/context-rot<p>[2] https://claude.com/blog/1m-context-ga<p>[3] https://github.com/anthropics/claude-code/issues/35545<p>[4] https://www.claudecodecamp.com/p/claude-code-1m-context-window<p>[5] https://github.com/anthropics/claude-code/issues/3370