Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)
Evolving story · 1 updatesOrnith-1.0-35B GGUF Performance and Speculative-Decode UpdatesTimeline →A follow-up update to the Ornith-1.0-35B GGUF model introduces native MTP speculative-decode grafting, achieving 1.3-1.35x single-stream decode speed with identical token distribution to the target model.

Follow-up to my previous Ornith-1.0-35B Q3_K_M post.
I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp:
1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s).
Next-token distribution is byte-identical to target-only (KLD 0.0, 32/32).
BF16 KLD 0.073 — slightly better than Q4_K_M.
Issue: not bit-exact to target-only over lon
Source: Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1). Read the full piece at the source.
Summary and analysis generated by AI (mistral). Always verify against the original sources.

Suno launches Spark incubator program to feed independent artists to its AI machine

DeepSpec - a deepseek-ai Collection
