Speculative Decoding | vllm 学习笔记

使用小模型（draft model）快速生成候选 token，再由大模型（target model）并行验证，在不损失精度的前提下加速推理。

为什么需要 Speculative Decoding

自回归解码每步只能生成一个 token，GPU 利用率极低（memory-bound）。Speculative Decoding 利用 draft model 一次生成 K 个候选 token，target model 通过一次 forward 验证所有候选，接受匹配的 token 并拒绝不匹配的。平均接受率较高时，等效每步生成多个 token，显著降低延迟。

核心原理

Draft-then-Verify：Draft model 自回归生成 K 个 token，Target model 对 K+1 个位置做一次 forward，得到每个位置的概率分布。
拒绝采样：对每个候选 token，若 Target model 的概率 >= Draft model 的概率则接受；否则按比例概率拒绝并以 Target model 的分布采样替代。
无损保证：数学上可证明输出分布与仅用 Target model 完全一致。
多种 Draft 来源：小型 LLM、Medusa head、EAGLE、n-gram 猜测等都可作为 draft 来源。

在源码中的实现

vllm/spec_decode/ — Speculative Decoding 的核心实现目录。
vllm/spec_decode/multi_step_worker.py — Multi-step worker 驱动 draft 模型多步生成。
vllm/spec_decode/spec_decode_worker.py — Target worker 验证候选 token 的主逻辑。
vllm/spec_decode/batch_expansion.py — 将 K 个候选 token 展开为 batch 进行并行验证。
vllm/config.py — SpeculativeConfig 定义 draft model、num_speculative_tokens 等参数。

为什么需要 Speculative Decoding ​

核心原理 ​

在源码中的实现 ​

相关概念 ​

为什么需要 Speculative Decoding

核心原理

在源码中的实现

相关概念