Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp
Evolving story · 1 updatesllama.cpp performance optimizationsTimeline →A PR in llama.cpp removes redundant softmax+sort in Top-N-Sigma sampler, boosting inference speed by 50% on Gemma-4-E4B-Q8_0.

- ›PR #22645 in llama.cpp removes redundant softmax+sort in Top-N-Sigma sampler when followed by Dist.
- ›Benchmarks show a 50% speedup (30t/s → 45t/s) on Gemma-4-E4B-Q8_0 on an M3 Max MacBook Pro.
- ›Token latency reduced by ~10ms per token due to the optimization.
- ›Targets local inference efficiency for quantized models.
- ›No change to model output; purely a performance optimization.
The pull request #22645 in the ggml-org/llama.cpp repository optimizes the Top-N-Sigma sampler by eliminating an unconditional softmax followed by sorting. This operation is unnecessary when the sampler is followed by a distribution sampler (Dist), as the computed values are discarded. Benchmarks on an M3 Max MacBook Pro show a 50% increase in tokens per second (from ~30t/s to ~45t/s) for the google_gemma-4-E4B-it-Q8_0 model, reducing token latency by 10ms. The change targets efficiency in local inference scenarios, particularly for quantized models.
Source: Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp. Read the full piece at the source.
Developers using llama.cpp for local inference can leverage this optimization for faster token generation without altering model behavior.
Businesses deploying local AI models benefit from reduced latency and improved throughput, enhancing user experience.
Indirectly supports the efficiency narrative in local AI inference, which may influence hardware and software investment trends.
Demonstrates practical optimization techniques in AI inference pipelines, useful for learning.
Highlights ongoing improvements in open-source AI tooling, making local AI more accessible and performant.
- Top-N-Sigma
- A sampling method in inference that selects tokens based on a probabilistic distribution derived from the top N values.
- softmax
- A function that converts a vector of values into a probability distribution.
- Dist
- A distribution sampler in inference pipelines that selects tokens based on a probability distribution.
- tokens per second (t/s)
- A metric measuring the speed of token generation during inference.
- quantized model (Q8_0)
- A model compressed to 8-bit precision to reduce memory usage and improve inference speed.
AI bias estimate: Neutral; focuses on technical improvements with clear benchmarks. (Automated estimate, not a definitive judgement.)
Summary and analysis generated by AI (mistral). Always verify against the original sources.

Suno launches Spark incubator program to feed independent artists to its AI machine

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

DeepSpec - a deepseek-ai Collection
