扩散型大型语言模型可能会使大部分人工智能工程技术栈变得过时。

1作者: victorpiles99大约 2 个月前原帖
我本周深入研究了扩散语言模型,我认为这是目前人工智能领域中最被低估的方向。 自回归大型语言模型(LLMs)的核心问题: 如今的每个主要模型(如GPT、Claude、Gemini)都是一次生成一个标记,从左到右。每个标记都依赖于前一个标记。这一单一的架构限制塑造了整个人工智能行业: - 模型无法修改已生成的内容 → 我们构建了思维链、反思和多轮推理,迫使它们在“提交”之前进行思考。 - 每个标记只能进行一次前向传递 → 我们在推测解码、KV缓存和量化方面投入了大量资金,以使生成过程变得可接受。 - 无法在输出过程中进行编辑 → 我们构建了带有重试循环、工具调用和规划层的代理框架来绕过这一限制。 - 无法并行生成 → 我们构建了将多个缓慢调用串联在一起的协调系统。 我们今天所称的“人工智能工程”大部分都是围绕一个问题进行修补:模型无法回顾。 扩散语言模型颠覆了这一范式。它们从一个被遮蔽的标记画布开始,逐步并行地完善整个输出。每个位置同时更新,模型在每一步都能看到并编辑其所有输出。这一原理与图像扩散(如Stable Diffusion、DALL-E)相同,应用于文本。 我认为这一理论确实成立的原因: 1. 并行性是真实的,而非理论上的。Inception Labs的Mercury 2(闭源、基于扩散)已经在MMLU、HumanEval和MATH上的质量与GPT-4o mini相当,达到了约1000个标记每秒。这不是基准测试的技巧,而是因为没有受到顺序生成的瓶颈限制。 2. 复杂性大幅降低。如果模型能够一次性看到并编辑其整个输出,你就不需要我们构建的一半支架:反思提示变得原生(模型已经在自己的输出上进行迭代),重试循环变得不必要(就地编辑),规划代理变得更简单(模型可以重组,而不仅仅是附加)。整个架构变得扁平化。 3. 转换路径是存在的。你可以通过微调将现有的预训练自回归模型转换为扩散模型——无需从头开始预训练。这意味着已经在自回归预训练上投资的数十亿并没有浪费。这是一个升级路径,而不是重启。 目前的主要限制是:固定的输出长度。在生成开始之前,必须预先分配画布大小。块扩散(在顺序块中生成,在每个块内扩散)是一种解决方案。分层生成——先绘制大纲,再并行扩展各部分——是另一种方法。具有讽刺意味的是,协调这一过程需要一个代理,因此扩散并没有消灭代理,而是改变了它们的工作方式。 坦诚地说:开放的扩散语言模型在知识和推理方面仍然落后于同规模的顶级自回归模型。但Mercury 2显示出上限很高,转换结果令人惊讶地好,架构消除了整个类别的工程复杂性。我认为在一年内,我们将看到扩散模型与前沿自回归模型竞争,当那时,许多当前的工具(代理框架、提示工程技术、推理优化堆栈)将变得显著简单或不再必要。 在研究这一切时,我发现了dLLM,这是一个开源库,统一了扩散语言模型的训练、推理和评估。它提供了LLaDA、Dream、块扩散的配方,以及将任何自回归模型转换为扩散模型的工具。如果你想进行实验,这是一个不错的起点。 论文链接:[https://arxiv.org/abs/2602.22661](https://arxiv.org/abs/2602.22661) 代码链接:[https://github.com/ZHZisZZ/dllm](https://github.com/ZHZisZZ/dllm) 模型链接:[https://huggingface.co/dllm-hub](https://huggingface.co/dllm-hub)
查看原文
I&#x27;ve been deep-diving into diffusion language models this week and I think this is the most underrated direction in AI right now.<p>The core issue with autoregressive LLMs:<p>Every major model today (GPT, Claude, Gemini) generates one token at a time, left to right. Each token depends on the previous one. This single architectural constraint has shaped the entire AI industry:<p>- Models can&#x27;t revise what they already wrote → we build chain-of-thought, reflection, and multi-pass reasoning to force them to &quot;think before committing&quot; - One forward pass per token → we invest heavily in speculative decoding, KV-caches, and quantization to make generation tolerable - Can&#x27;t edit mid-output → we build agent frameworks with retry loops, tool calls, and planning layers to work around it - Can&#x27;t generate in parallel → we build orchestration systems that chain multiple slow calls together<p>Most of what we call &quot;AI engineering&quot; today is patching around one thing: the model can&#x27;t look back.<p>Diffusion LMs flip the paradigm. Start with a canvas of masked tokens, iteratively refine the entire output in parallel. Every position updated simultaneously, the model sees and edits all of its output at every step. Same principle as image diffusion (Stable Diffusion, DALL-E), applied to text.<p>Why I think the theory actually holds:<p>1. Parallelism is real, not theoretical. Inception Labs&#x27; Mercury 2 (closed-source, diffusion-based) already hits ~1000 tok&#x2F;s with quality competitive with GPT-4o mini on MMLU, HumanEval, MATH. That&#x27;s not a benchmark trick — it&#x27;s a direct consequence of not being bottlenecked by sequential generation.<p>2. The complexity reduction is massive. If a model can see and edit its entire output at once, you don&#x27;t need half the scaffolding we&#x27;ve built: reflection prompting becomes native (the model already iterates on its own output), retry loops become unnecessary (edit in place), planning agents get simpler (the model can restructure, not just append). The whole stack flattens.<p>3. The conversion path exists. You can take an existing pretrained AR model and convert it to diffusion via fine-tuning alone — no pretraining from scratch. This means the billions already invested in AR pretraining aren&#x27;t wasted. It&#x27;s an upgrade path, not a restart.<p>The main limitation today: fixed output length. You must pre-allocate the canvas size before generation starts. Block Diffusion (generating in sequential chunks, diffusing within each chunk) is one workaround. Hierarchical generation — outline first, expand sections in parallel — is another. Ironically, orchestrating that requires an agent, so diffusion doesn&#x27;t kill agents — it changes what they do.<p>Honest take: Open diffusion LMs still trail top AR models on knowledge and reasoning at comparable scale. But Mercury 2 shows the ceiling is high, the conversion results are surprisingly good, and the architecture eliminates entire categories of engineering complexity. I think within a year we&#x27;ll see diffusion models competitive with frontier AR models, and when that happens, a lot of the current tooling (agent frameworks, prompt engineering techniques, inference optimization stacks) gets dramatically simpler or unnecessary.<p>While researching all this I found dLLM, an open-source library that unifies training, inference, and evaluation for diffusion LMs. It has recipes for LLaDA, Dream, Block Diffusion, and converting any AR model to diffusion. Good starting point if you want to experiment.<p>Paper: https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2602.22661<p>Code: https:&#x2F;&#x2F;github.com&#x2F;ZHZisZZ&#x2F;dllm<p>Models: https:&#x2F;&#x2F;huggingface.co&#x2F;dllm-hub