问HN:我们为什么不使用“语义分词器”/代码本来处理文本?

1作者: gavinray2 个月前原帖
BPE(字节对编码)能够高效地对子词进行分词,但它对语义结构没有任何意识——它纯粹是在优化词汇量与序列长度之间的权衡。我阅读了LanDiff的相关研究[0],他们训练了一种“语义分词器”,使用代码本将3D视觉特征压缩成1D离散令牌流,然后在这些语义令牌上训练语言模型(与原始视觉特征相比,压缩比约为14,000倍)。结果超越了Sora和体积是其三倍的模型。 那么,为什么我们不能对文本做类似的事情呢?学习一个离散的语义代码本,针对文本片段/短语进行推理,然后将其解码回自然语言。 是否因为以下原因: - 文本本身已经是高密度的符号表示,因此收益有限 - “语义保真度”对于一个有损的文本编码器来说太难以定义 - 扩展原始令牌的方式依然有效,因此没有人有动力去改变 - 以上原因的某种组合 我认为最近的“神经编码器”研究(Meta BLT,DeepMind 2024)在某种程度上与此相似,只是应用于原始编码器/信号数据? [0] https://arxiv.org/pdf/2503.04606
查看原文
BPE tokenizes subwords efficiently, but it has zero awareness of semantic structure -- it&#x27;s purely optimizing vocabulary&#x2F;sequence length tradeoffs.<p>I read LanDiff [0], where they train a &quot;semantic tokenizer&quot; with codebooks that compresses 3D visual features into a 1D discrete token stream, Then train an LM over those semantic tokens (~x14,000 compression vs raw visual features). The results beat Sora and models x3 its size.<p>So why can&#x27;t we do the analogous thing for text? Learn a discrete semantic codebook over spans&#x2F;phrases, reason over that compressed sequence, decode back to natural language.<p>Is it that:<p>- text is already a high-density symbolic representation so gains are marginal - &quot;semantic fidelity&quot; is too hard to define for a lossy text codec - scaling raw tokens keeps working so nobody&#x27;s motivated - some combination of the above<p>I think that the recent &quot;neural codec&quot; research (Meta BLT, DeepMind 2024) is somewhat similar to this, just applied to raw codec&#x2F;signal data?<p>[0] https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2503.04606