HackerNews中文版

我意识到，如果不允许推理标记，模型的表现非常糟糕。它无法进行简单的算术运算或逻辑推理，并且会出现一些幻觉。但是，通过允许模型思考一会儿再回答，结果会好得多，可信度也更高。这显示了一个干净的强化学习环境，或者只是一个良好的数据集。在这个过程中，你可以对模型进行两次提示——一次是不允许思考，另一次是允许思考。如果不允许思考的结果与允许思考得到的答案相矛盾，就对其进行惩罚。

查看原文

I realise that without allowing reasoning tokens, a model performs very poorly. It can't perform simple arithmetic or simple logic and hallucinates a bit.<p>But by allowing it to think a bit and then answer, the result is much better and way more trustable.<p>This shows a clean RL environment.. or just a nice data-set. Where you prompt the model two times - one without allowing thinking and one with thinking. Penalise the result from non thinking if the result contradicts the answer obtained from thinking.

为什么大型语言模型（LLM）不在自己的思维链上进行训练？