Speculative Decoding: How Draft-and-Verify Speeds Up LLM Inference in 2026
- Mark Chomiczewski
- 6 May 2026
- 0 Comments
Have you ever noticed how an AI assistant pauses for a second before finishing your sentence? That delay isn't just thinking time; it's the bottleneck of sequential token generation. In traditional autoregressive models, every single word is generated one after another. The model predicts the next token, outputs it, feeds it back into itself, and repeats. This linear process creates a massive memory-bound constraint that slows down response times, especially as context windows grow longer.
In 2026, this latency is no longer acceptable for production-grade applications. Users expect instant responses. Developers need higher throughput without burning through GPU budgets. Enter speculative decoding, a technique that uses a draft-and-verify pipeline to accelerate large language model inference without sacrificing output quality. By leveraging a smaller, faster model to propose multiple tokens at once, and then having the larger target model verify them in parallel, we can achieve speedups of 2-3× or more. It’s not magic-it’s clever engineering rooted in computer architecture principles.
The Core Problem: Memory-Bound Inference
To understand why speculative decoding matters, you first have to look at what limits Large Language Model (LLM) performance today. It’s not compute power. Modern GPUs like the NVIDIA A100 or H100 are incredibly fast at matrix multiplication. The real bottleneck is memory bandwidth. Every time the model generates a token, it must load its entire parameter set from high-speed VRAM into the processing units. For a model with 70 billion parameters, this happens sequentially for every single word.
This creates a strict ceiling on throughput. If your model takes 50 milliseconds to generate one token, you’re stuck with 20 tokens per second, regardless of how powerful your hardware is. Traditional optimization techniques like quantization or kernel fusion help, but they hit diminishing returns quickly. Speculative decoding attacks the problem differently. Instead of trying to make the big model faster, it reduces the number of times the big model needs to run at all.
How Draft-and-Verify Works
Speculative decoding operates on a simple loop with three distinct phases: draft generation, parallel verification, and rejection sampling. Think of it like a junior intern drafting an email while the senior manager reviews it. The intern writes quickly but might make mistakes. The manager reads the whole draft at once, approves the correct parts, and fixes the errors.
- Draft Generation: A smaller, lighter "draft" model (like Gemma-2B or Llama-3-8B) predicts the next K tokens-typically between 3 and 10 tokens-in rapid succession. Because this model is small, it runs much faster than the target model.
- Parallel Verification: The larger "target" model (like Llama-3-70B or Mixtral-8x7B) receives the input sequence plus all K draft tokens. Crucially, transformer architectures allow the model to compute probability distributions for the next token at *every* position in the sequence simultaneously. So, in one forward pass, the target model evaluates all K proposed tokens.
- Rejection Sampling: The system compares the probabilities assigned by the draft model against those assigned by the target model. If the target model agrees or is more confident, the token is accepted. If the target model disagrees significantly, the token is rejected, and the speculation stops there. The target model then generates the correct next token, and the cycle restarts.
This mechanism guarantees that the final output distribution matches exactly what the target model would have produced if it had generated each token individually. You get the speed of the small model with the accuracy of the large one.
The Math Behind Rejection Sampling
You don’t need to be a mathematician to implement this, but understanding the logic helps when tuning your pipeline. The core decision rule relies on comparing two probabilities: $P_{draft}$ (the probability the draft model assigned to its chosen token) and $P_{target}$ (the probability the target model assigns to that same token).
Here is how the acceptance works:
- If $P_{target} \ge P_{draft}$, the token is always accepted. The target model either agrees or likes the choice even better than the draft model did.
- If $P_{target} < P_{draft}$, the token is accepted with probability $P_{target} / P_{draft}$. If it’s rejected, that token and all subsequent draft tokens are discarded. The target model then generates the next token from scratch, and the speculation loop resets.
Let’s look at a concrete example. Suppose the draft model proposes the sequence "discovered a breakthrough."
- Token 1: "discovered" ($P_{draft}=0.6$, $P_{target}=0.8$). Accepted because $0.8 \ge 0.6$.
- Token 2: "a" ($P_{draft}=0.7$, $P_{target}=0.75$). Accepted because $0.75 \ge 0.7$.
- Token 3: "breakthrough" ($P_{draft}=0.5$, $P_{target}=0.2$). Rejected because $0.2 < 0.5$. The system discards "breakthrough" and any further drafts. The target model generates "new" instead.
The final output becomes "discovered a new...". Notice that the user sees no difference in quality, but the system processed three tokens in the time it usually takes to process one.
Performance Gains and Best-Case Scenarios
So, how much faster does this actually make your application? The answer depends on the acceptance rate-the percentage of draft tokens the target model approves. In the best-case scenario, where the target model accepts all K draft tokens, you generate $K+1$ tokens in a single iteration. Why $K+1$? Because while verifying the K drafts, the target model also computes the probability for the $(K+1)^{th}$ token, which is then sampled and added to the sequence.
If you set $K=5$ and achieve 100% acceptance, you produce six tokens per target forward pass. Compared to standard generation, which produces one token per pass, this is a theoretical 6× speedup. In practice, acceptance rates rarely hit 100%, but they often sit between 60% and 90% depending on the complexity of the task. Production systems using frameworks like vLLM have demonstrated consistent 2-3× improvements in both throughput and latency on clusters of 8 A100 GPUs.
| Acceptance Rate | Tokens Generated per Iteration | Effective Speedup |
|---|---|---|
| 100% | 6 | 6.0× |
| 80% | ~4.8 | 4.8× |
| 60% | ~3.6 | 3.6× |
| 40% | ~2.4 | 2.4× |
These gains are most pronounced in long-context scenarios where the overhead of loading parameters dominates the computation time. For short queries, the benefit is less dramatic but still measurable.
Medusa Architecture: An Evolution
While traditional speculative decoding uses two separate models (a draft model and a target model), researchers introduced an alternative approach called Medusa, an architecture that adds multiple prediction heads directly onto the base LLM to predict future tokens in a single pass. Introduced in a 2024 ICML paper, Medusa eliminates the need for a separate speculator model entirely.
Instead of running a small model sequentially, Medusa attaches several lightweight feed-forward layers (prediction heads) to the last hidden layer of the base LLM. When the model processes the current context, these heads simultaneously predict the next K tokens, creating a tree of possible continuations. This approach reduces the computational overhead of switching between models and allows for tighter integration within existing serving infrastructure.
However, Medusa requires modifying the model weights during training or fine-tuning, whereas traditional speculative decoding can be applied to any pre-trained model pair without retraining. This makes traditional speculative decoding more flexible for organizations using proprietary or closed-source models where access to internal weights is restricted.
Implementation in Production Frameworks
You don’t need to build this from scratch. Major inference frameworks have integrated speculative decoding support out of the box.
vLLM offers highly optimized speculative decoding specifically designed to reduce inter-token latency under medium-to-low query-per-second (QPS) workloads. It handles the complex scheduling and memory management required to run the draft and target models efficiently on shared GPU resources. vLLM supports various draft models, including distilled versions of the target model or completely different architectures.
Hugging Face Transformers implements this feature under the name "assisted generation." It allows developers to specify a `helper_model` alongside their main model. The library manages the draft-and-verify loop internally, making it accessible via standard `generate()` calls. This is particularly useful for experimentation and prototyping.
A common production setup involves pairing a smaller model like Gemma-2-2B-it, a lightweight instruction-tuned model suitable for draft generation with a larger model like Gemma-2-9B-it, a more capable model used for verification and final output. This combination provides a sweet spot between speed and quality for many chatbot and summarization tasks.
Edge Devices and On-Device Applications
Speculative decoding isn’t just for cloud data centers. It has significant implications for edge computing and mobile devices. Running full-sized LLMs on phones or laptops is resource-intensive. By using speculative decoding, you can offload the heavy lifting to a local, smaller model while occasionally checking in with a larger remote model or a slightly larger local model.
On-device applications like language translators, coding assistants, and interactive games benefit from reduced memory requirements. The draft model fits comfortably in RAM, while the verification step can be handled by a more powerful component or optimized via quantization. This hybrid approach enables near-real-time interactions on hardware that previously couldn’t support such workloads.
When Not to Use Speculative Decoding
Despite its benefits, speculative decoding isn’t a silver bullet. There are scenarios where it adds unnecessary complexity or overhead.
- Short Contexts: If your inputs are very short (e.g., few-shot classification), the overhead of managing two models may outweigh the benefits of parallel token generation.
- Highly Creative Tasks: Tasks requiring high perplexity or unpredictable outputs often result in lower acceptance rates. If the draft model frequently guesses wrong, the verification cost cancels out the speed gains.
- Hardware Constraints: If you don’t have enough VRAM to hold both the draft and target models simultaneously, speculative decoding won’t work unless you use aggressive quantization or paging strategies.
Always benchmark your specific workload. Measure the baseline latency without speculative decoding, then test with a draft model that is roughly 1/4 to 1/10 the size of your target model. Monitor the acceptance rate closely-if it drops below 40%, consider switching to a better-aligned draft model or reducing the draft length K.
Does speculative decoding change the output quality of the LLM?
No. Speculative decoding uses rejection sampling to ensure that the final output distribution is identical to what the target model would produce if it generated tokens sequentially. The only difference is speed, not content quality.
What is the best draft model to use?
The ideal draft model is smaller than the target model but trained on similar data. Distilled versions of the target model often work best because they mimic its behavior closely, leading to higher acceptance rates. Models like Gemma-2B or Llama-3-8B are popular choices for larger targets.
Can I use speculative decoding with any LLM framework?
Most modern frameworks support it. vLLM, Hugging Face Transformers (as assisted generation), and TensorRT-LLM all offer implementations. Check your framework’s documentation for specific configuration flags related to speculative decoding or draft models.
How does Medusa differ from traditional speculative decoding?
Traditional speculative decoding uses two separate models (draft and target). Medusa adds prediction heads directly to the base model, allowing it to predict multiple future tokens in a single pass without a separate speculator. Medusa requires model modification, while traditional methods do not.
What is a good acceptance rate for speculative decoding?
An acceptance rate above 60% is generally considered effective. Rates between 80% and 90% indicate excellent alignment between the draft and target models. Below 40%, the overhead of verification may negate the speed benefits.