AI Tools 69% 1 min readJun 28, 2026, 6:35 PM

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Evolving story · 1 updatesOrnith-1.0-35B GGUF Performance and Speculative-Decode UpdatesTimeline →

30-second summary

A follow-up update to the Ornith-1.0-35B GGUF model introduces native MTP speculative-decode grafting, achieving 1.3-1.35x single-stream decode speed with identical token distribution to the target model.

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Full story

Follow-up to my previous Ornith-1.0-35B Q3_K_M post.

I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp:

1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s).

Next-token distribution is byte-identical to target-only (KLD 0.0, 32/32).

BF16 KLD 0.073 — slightly better than Q4_K_M.

Issue: not bit-exact to target-only over lon

Source: Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1). Read the full piece at the source.

Sources · 1

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1) ↗

Summary and analysis generated by AI (mistral). Always verify against the original sources.

Suno launches Spark incubator program to feed independent artists to its AI machine

1 min read3d ago

DeepSpec - a deepseek-ai Collection

1 min read3d ago

DFlash support merged into llama.cpp

1 min read3d ago

TickrWire

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C

1 min read3d ago