为什么大型语言模型(LLM)不在自己的思维链上进行训练?
我意识到,如果不允许推理标记,模型的表现非常糟糕。它无法进行简单的算术运算或逻辑推理,并且会出现一些幻觉。
但是,通过允许模型思考一会儿再回答,结果会好得多,可信度也更高。
这显示了一个干净的强化学习环境,或者只是一个良好的数据集。在这个过程中,你可以对模型进行两次提示——一次是不允许思考,另一次是允许思考。如果不允许思考的结果与允许思考得到的答案相矛盾,就对其进行惩罚。
查看原文
I realise that without allowing reasoning tokens, a model performs very poorly. It can't perform simple arithmetic or simple logic and hallucinates a bit.<p>But by allowing it to think a bit and then answer, the result is much better and way more trustable.<p>This shows a clean RL environment.. or just a nice data-set. Where you prompt the model two times - one without allowing thinking and one with thinking. Penalise the result from non thinking if the result contradicts the answer obtained from thinking.