HackerNews中文版

BPE（字节对编码）能够高效地对子词进行分词，但它对语义结构没有任何意识——它纯粹是在优化词汇量与序列长度之间的权衡。我阅读了LanDiff的相关研究[0]，他们训练了一种“语义分词器”，使用代码本将3D视觉特征压缩成1D离散令牌流，然后在这些语义令牌上训练语言模型（与原始视觉特征相比，压缩比约为14,000倍）。结果超越了Sora和体积是其三倍的模型。那么，为什么我们不能对文本做类似的事情呢？学习一个离散的语义代码本，针对文本片段/短语进行推理，然后将其解码回自然语言。是否因为以下原因： - 文本本身已经是高密度的符号表示，因此收益有限 - “语义保真度”对于一个有损的文本编码器来说太难以定义 - 扩展原始令牌的方式依然有效，因此没有人有动力去改变 - 以上原因的某种组合我认为最近的“神经编码器”研究（Meta BLT，DeepMind 2024）在某种程度上与此相似，只是应用于原始编码器/信号数据？ [0] https://arxiv.org/pdf/2503.04606

查看原文

BPE tokenizes subwords efficiently, but it has zero awareness of semantic structure -- it's purely optimizing vocabulary/sequence length tradeoffs.I read LanDiff [0], where they train a "semantic tokenizer" with codebooks that compresses 3D visual features into a 1D discrete token stream, Then train an LM over those semantic tokens (~x14,000 compression vs raw visual features). The results beat Sora and models x3 its size.So why can't we do the analogous thing for text? Learn a discrete semantic codebook over spans/phrases, reason over that compressed sequence, decode back to natural language.Is it that:- text is already a high-density symbolic representation so gains are marginal - "semantic fidelity" is too hard to define for a lossy text codec - scaling raw tokens keeps working so nobody's motivated - some combination of the aboveI think that the recent "neural codec" research (Meta BLT, DeepMind 2024) is somewhat similar to this, just applied to raw codec/signal data?[0] https://arxiv.org/pdf/2503.04606

问HN：我们为什么不使用“语义分词器”/代码本来处理文本？