HackerNews中文版

这是我无法找到答案的问题，也暂时无法自己解答的内容：在Claude Code中，我使用Opus 4.6 1M，但通过仔细管理会话保持在250k以下，以避免已知的NoLiMa [0] / 上下文衰退 [1] 问题。然而，我一直想要得到的答案是：在使用约165k个token的情况下，Opus 1M的质量是否真的比Opus 200k更高？（我使用约165k是为了考虑token缓冲和其他因素，但理论上，这个数字也可能是约195k，关键是——在Opus 200k部署的极限情况下） NoLiMa表明，在约165k的请求下，Opus 200k的表现会很差，而Opus 1M会更好（因为使用的上下文窗口的比例较低）……但它们是同一个模型。然而，实际上推理部署的差异可能会改变整个范式，对吗？我感到非常困惑。 Anthropic表示它们是同一个模型 [2]。但是，Claude Code自己的源代码将它们视为具有不同路由的独立变体 [3]。我找到的最接近的测试 [4] 断言它们在200K以下是相同的，但实际上并没有进行A/B测试，对吗？在Claude Code内部，可能无法进行测试，对吧？根据这个问题 [5]，CLI对于相同输入是非确定性的，代理会话在工具使用上会分支。需要一个干净的API级别测试。 *我真正想知道的是关于我自己应用中基于Claude的功能的API级别测试。是否有一个真实的基准？* 我在这个问题上已经达到了理解的极限。如果我所说的有任何道理，任何帮助都将不胜感激。如果有人能帮我更好地提出这个问题，我也会非常感激。 [0] https://arxiv.org/abs/2502.05167 [1] https://research.trychroma.com/context-rot [2] https://claude.com/blog/1m-context-ga [3] https://github.com/anthropics/claude-code/issues/35545 [4] https://www.claudecodecamp.com/p/claude-code-1m-context-window [5] https://github.com/anthropics/claude-code/issues/3370

查看原文

Here is the question for which I cannot find an answer, and cannot yet afford to answer myself:In Claude Code, I use Opus 4.6 1M, but stay under 250k via careful session management to avoid known NoLiMa [0] / context rot [1] crap. The question I keep wanting answered though: at ~165k tokens used, does Opus 1M actually deliver higher quality than Opus 200k? (I used ~165k to account for token buffer, and other stuff, but in theory, it may as well be ~195k, the point is - at the limit of the Opus 200k deployment)NoLiMa would indicate that with a ~165k request, Opus 200k would suck, and Opus 1M would be better (as a lower percentage of the context window was used)... but they are the same model. However, there are practical inference deployment differences that could change the whole paradigm, right? I am so confused.Anthropic says it's the same model [2]. But, Claude Code's own source treats them as distinct variants with separate routing [3]. Closest test I found [4] asserts they're identical below 200K but it never actually A/B tests, correct?Inside Claude Code it's probably not testable, right? According to this issue [5], the CLI is non-deterministic for identical inputs, and agent sessions branch on tool-use. Would need a clean API-level test.The API level test is what I really want to know for the Claude based features in my own apps. Is there a real benchmark for this?I have reached the limits of my understanding on this problem. If what I am trying to say makes any sense, any help would be greatly appreciated.If anyone could help me ask the question better, that would also be appreciated.[0] https://arxiv.org/abs/2502.05167[1] https://research.trychroma.com/context-rot[2] https://claude.com/blog/1m-context-ga[3] https://github.com/anthropics/claude-code/issues/35545[4] https://www.claudecodecamp.com/p/claude-code-1m-context-window[5] https://github.com/anthropics/claude-code/issues/3370

请问HN：在约165,000个标记的情况下，Claude Opus 4.6 1M是否优于Opus 4.6 200k？