AI Research 70% 1 min readJun 25, 2026, 9:55 PM

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Evolving story · 1 updatesSpeculative Decoding with Parallel Tree DraftingTimeline →

30-second summary

JetSpec introduces a novel speculative decoding method using parallel tree drafting to achieve up to 9.64x lossless speedup in LLM inference, reaching over 1000 TPS on a single B200 GPU.

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Key takeaways

›JetSpec achieves up to 9.64x lossless speedup in LLM inference using parallel tree drafting for speculative decoding.
›Performance gains demonstrated on MATH-500 (9.64x) and open-ended chat (4.58x) benchmarks.
›Throughput exceeds 1000 TPS on a single B200 GPU with CUDA optimizations.
›Method maintains lossless generation while optimizing drafting cost and quality.
›Builds on prior speculative decoding work but introduces parallel tree drafting for efficiency.

Full story

Researchers have developed JetSpec, a speculative decoding framework that optimizes both drafting cost and quality through causal parallel tree drafting. This method enables lossless inference speedups of up to 9.64x on MATH-500 and 4.58x on open-ended chat benchmarks. By leveraging CUDA graph and kernel optimizations, JetSpec achieves throughput exceeding 1000 tokens per second (TPS) on a single NVIDIA B200 GPU. The approach addresses prior limitations of speculative decoding by improving drafting efficiency without sacrificing accuracy.

Source: [Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS. Read the full piece at the source.

Why this matters

Developers

Provides a practical, high-performance speculative decoding method for faster LLM inference without accuracy loss, with open-source potential.

Businesses

Enables cost-effective scaling of LLM deployments by reducing inference latency and increasing throughput per GPU.

Investors

Highlights innovation in LLM optimization, which could drive demand for hardware and software supporting such techniques.

Students

Demonstrates advanced techniques in speculative decoding and GPU optimization for AI inference.

Everyone

Showcases progress in making AI models faster and more efficient, a key step toward broader accessibility.

Glossary

Speculative Decoding: A technique to speed up LLM inference by predicting multiple tokens in parallel and verifying them in a single step.
Parallel Tree Drafting: A method in speculative decoding where multiple token drafts are generated in parallel using a tree structure for efficiency.
Lossless Inference: Generating output with no degradation in quality or accuracy compared to standard inference.
TPS (Tokens Per Second): A metric measuring the throughput of an LLM, indicating how many tokens it can process per second.
CUDA Graph: A NVIDIA GPU optimization feature that captures and reuses sequences of operations for reduced overhead.

AI bias estimate: Neutral technical reporting with no evident bias; source is a research-focused Reddit post. (Automated estimate, not a definitive judgement.)

Sources · 1

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS ↗

Summary and analysis generated by AI (mistral). Always verify against the original sources.

TickrWire

NSF Prepares To Announce Artificial Intelligence Coordination Hubs - AFCEA International

1 min read5h ago

TickrWire

Chinese A.I. Models Close the Gap With Anthropic and OpenAI - The New York Times

1 min read9h ago

TickrWire

A Pilot Study on the Efficacy of Artificial Intelligence-Driven Monocular Three-Dimensional Conversion for Endoscopic Spatial Perception - Cureus

1 min read10h ago

TickrWire

Nearly 100% of patients surveyed say they’d want to know when AI is used in imaging - Radiology Business

1 min read11h ago