This project explores speculative decoding with an extra layer of verifier guidance to speed up autoregressive generation while keeping outputs faithful to the target model.
Here’s the big picture:
- A drafter proposes multiple next tokens cheaply.
- A target model remains the source of truth.
- A verifier (lightweight logic + optional scoring head) guides which drafted tokens are worth sending to the target for confirmation and how we schedule verification.
- Two structural policies—Tree and Cascade—shape how drafts are organized and checked.
The repo includes an executable notebook, raw benchmark CSV, and a summary figure of results.
Speculative decoding gives real speedups by letting a smaller model “draft” ahead. The catch is deciding which drafts are likely to be accepted by the target and how to verify them efficiently. That decision is what this project focuses on.
Verifier guidance is the idea that you can improve acceptance and scheduling by using:
- inexpensive signals (probabilities, entropy, agreement between models),
- structures (Tree vs Cascade) that control how drafts flow to the target,
- and policies that adapt based on uncertainty.
At each decoding step we:
- Use the drafter
Dto propose a block ofktokens and their probabilities. - Compute cheap verifier signals
sfor each proposed token (examples below). - Filter, order, or regroup proposals using a Tree or Cascade policy.
- Ask the target model
Tto verify the ordered set until a mismatch. - Accept the verified prefix and continue.
The verifier isn’t a separate heavy model; it’s a light decision layer that runs before we spend target compute.
Typical verifier signals
p_D(y_i)and entropy for the drafted tokeny_i- agreement:
p_D(y_i)vs a cached estimate ofp_T(y_i)from the previous step - local constraints: repetition penalties, n-gram bans, top-p/top-k gates
- learned score (optional): a tiny head trained to predict acceptance from drafter features
The target still makes the final call. The verifier simply prioritizes what to check first.
Think of the drafted block as a small branching tree:
- Start with the most promising token (lowest entropy / highest score).
- If it is verified, expand to the next node; if not, backtrack to a sibling.
- This concentrates target calls on the most likely path first.
When it helps: the drafter is often “close” to the target; a small number of high-confidence branches cover most acceptance.
A multi-stage filter:
- Stage 1 (cheap): pure drafter heuristics (entropy, top-k gate, repetition).
- Stage 2 (still cheap): agreement checks and local constraints.
- Stage 3 (less cheap): optional small learned score (single forward of a tiny head).
- Target verification: only for candidates that clear earlier stages.
When it helps: you want predictable cost by trimming drafts aggressively before any target work.
def verifier_guided_sd_step(prefix, k, policy):
# 1) draft tokens with the small model
drafts, p_d = drafter(prefix, k) # tokens and probabilities
# 2) compute fast verifier signals
signals = compute_signals(prefix, drafts, p_d) # entropy, agreement, constraints, etc.
# 3) order/filter/regroup based on policy = {"tree"| "cascade" | "full"}
candidates = policy_schedule(drafts, signals, policy)
# 4) verify with target until mismatch
accepted = []
for tok in candidates:
if target_accepts(prefix, tok): # standard SD acceptance check
accepted.append(tok)
prefix = prefix + [tok]
else:
break
# 5) return accepted prefix (possibly empty)
return acceptedImplementation details live in the notebook:
- Drafting: batching + KV-cache reuse
- Signal computation: entropy, agreement, constraints
- Tree/Cascade schedulers: define how drafts are organized before verification
- Acceptance: always target-based, SD-compatible
Benchmarks are reproducible from:
benchmark_analysis.csv— raw resultsVerifier Guided SD (4).ipynb— end-to-end runs + plotsVerifier Guided SD.png— summary figure
Metrics we track:
- Throughput (tokens/sec): wall-clock speed while generating full sequences
- Relative performance: normalized vs a fixed reference
- Draft acceptance: mean accepted tokens per verification block
- Model-call efficiency: drafter/target calls per generated sequence
- Stability: std-dev of throughput across runs
I focus on how policies change acceptance and scheduling, not on declaring any baseline “good” or “bad.” The figure below summarizes the current prototype’s behavior across variants.
Verifier Guided SD (4).ipynb— main notebook with implementation, experiments, plotsbenchmark_analysis.csv— raw benchmark data used in the figuresVerifier Guided SD.png— combined visualization of results
This is a prototype aimed at understanding policy design. A few intentional trade-offs and known gaps:
-
Scheduling is not fully optimized
The balance between drafter steps and target verifications is heuristic and not yet tuned per-sequence. -
Verifier signals are simple
Entropy/agreement/constraints are cheap and fast, but a learned score could be better. -
Single-GPU, Python-level orchestration
There’s room for lower-level optimizations (CUDA streams, fused ops, more aggressive KV-cache management). -
Chunking strategy is basic
Draft verification is mostly linear; more advanced chunk acceptance (verify N at once) is on the roadmap. -
Limited model/dataset sweep
The goal so far was policy behavior, not exhaustive cross-model generalization.
-
Adaptive scheduling
- Adjust draft length
kand policy choice on the fly using uncertainty signals. - Early-exit rules when confidence is high to avoid unnecessary target calls.
- Adjust draft length
-
Learned verifier head
- A tiny classifier trained to predict acceptance from drafter features, calibrated with lightweight distillation from the target.
-
Better chunk verification
- Verify multiple tokens as a block when signals agree; fall back to token-wise checks when they don’t.
-
System optimizations
- Overlap drafter and target with CUDA streams.
- Tighter KV-cache reuse and prefetch.
- Batched verifications across sequences for higher hardware utilization.
-
Broader evaluation
- Sweep model sizes, prompts, and decoding temperatures.
- Include latency-critical settings and streaming output.
# 1) clone
git clone https://github.com/adityakamat24/verifier-guided-sd.git
cd verifier-guided-sd
# 2) (optional) create a fresh env
python -m venv .venv && source .venv/bin/activate # on Windows: .venv\Scripts\activate
# 3) install typical deps
pip install torch torchvision torchaudio \
transformers \
numpy pandas matplotlib jupyter
# 4) run the notebook
jupyter notebook "Verifier Guided SD (4).ipynb"Is the verifier a second big model?
No. It’s a light policy layer (plus an optional tiny head) that ranks/filters drafts before we ask the target.
Does the target still decide correctness?
Yes. Final acceptance always uses the target in a standard SD-compatible way.
Can this work with my favorite LM?
If it supports next-token probabilities and KV-cache, it should be straightforward. The logic is model-agnostic.
