← Back to feed
AI Tools 71% 1 min readJun 22, 2026, 5:18 PM

Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp

Evolving story · 1 updatesllama.cpp performance optimizationsTimeline →
30-second summary

A PR in llama.cpp removes redundant softmax+sort in Top-N-Sigma sampler, boosting inference speed by 50% on Gemma-4-E4B-Q8_0.

Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp
Key takeaways
  • PR #22645 in llama.cpp removes redundant softmax+sort in Top-N-Sigma sampler when followed by Dist.
  • Benchmarks show a 50% speedup (30t/s → 45t/s) on Gemma-4-E4B-Q8_0 on an M3 Max MacBook Pro.
  • Token latency reduced by ~10ms per token due to the optimization.
  • Targets local inference efficiency for quantized models.
  • No change to model output; purely a performance optimization.
Full story

The pull request #22645 in the ggml-org/llama.cpp repository optimizes the Top-N-Sigma sampler by eliminating an unconditional softmax followed by sorting. This operation is unnecessary when the sampler is followed by a distribution sampler (Dist), as the computed values are discarded. Benchmarks on an M3 Max MacBook Pro show a 50% increase in tokens per second (from ~30t/s to ~45t/s) for the google_gemma-4-E4B-it-Q8_0 model, reducing token latency by 10ms. The change targets efficiency in local inference scenarios, particularly for quantized models.

Source: Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp. Read the full piece at the source.

Why this matters
Developers

Developers using llama.cpp for local inference can leverage this optimization for faster token generation without altering model behavior.

Businesses

Businesses deploying local AI models benefit from reduced latency and improved throughput, enhancing user experience.

Investors

Indirectly supports the efficiency narrative in local AI inference, which may influence hardware and software investment trends.

Students

Demonstrates practical optimization techniques in AI inference pipelines, useful for learning.

Everyone

Highlights ongoing improvements in open-source AI tooling, making local AI more accessible and performant.

Glossary
Top-N-Sigma
A sampling method in inference that selects tokens based on a probabilistic distribution derived from the top N values.
softmax
A function that converts a vector of values into a probability distribution.
Dist
A distribution sampler in inference pipelines that selects tokens based on a probability distribution.
tokens per second (t/s)
A metric measuring the speed of token generation during inference.
quantized model (Q8_0)
A model compressed to 8-bit precision to reduce memory usage and improve inference speed.

AI bias estimate: Neutral; focuses on technical improvements with clear benchmarks. (Automated estimate, not a definitive judgement.)

Sources · 1

Summary and analysis generated by AI (mistral). Always verify against the original sources.

Related
TickrWire

AI news intelligence. We aggregate, verify, summarise and explain the latest artificial intelligence news from open, legal sources.

Daily AI digest

Top AI stories, summarised, in your inbox each morning.

© 2026 TickrWire. Summaries and analysis are AI-generated and may contain errors.Privacy