← Back to feed
AI Tools 69% 1 min readJun 28, 2026, 6:35 PM

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Evolving story · 1 updatesOrnith-1.0-35B GGUF Performance and Speculative-Decode UpdatesTimeline →
30-second summary

A follow-up update to the Ornith-1.0-35B GGUF model introduces native MTP speculative-decode grafting, achieving 1.3-1.35x single-stream decode speed with identical token distribution to the target model.

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)
Full story

Follow-up to my previous Ornith-1.0-35B Q3_K_M post.

I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp:

1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s).

Next-token distribution is byte-identical to target-only (KLD 0.0, 32/32).

BF16 KLD 0.073 — slightly better than Q4_K_M.

Issue: not bit-exact to target-only over lon

Source: Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1). Read the full piece at the source.

Sources · 1

Summary and analysis generated by AI (mistral). Always verify against the original sources.

Related
TickrWire

AI news intelligence. We aggregate, verify, summarise and explain the latest artificial intelligence news from open, legal sources.

Daily AI digest

Top AI stories, summarised, in your inbox each morning.

© 2026 TickrWire. Summaries and analysis are AI-generated and may contain errors.Privacy