LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels
Evolving story · 1 updatesIn-Browser LLM Inference via WebGPUTimeline →A 230M-parameter LFM2.5 model runs locally in-browser at 1,400 tokens/sec using custom WebGPU kernels, leveraging prior work from Fable 5 and Opus 4.8.

- ›LiquidAI/LFM2.5-230M runs locally in-browser at 1,400 tokens/sec using custom WebGPU kernels.
- ›Kernels were adapted from Fable 5 (shut down) and Opus 4.8, enabling efficient on-device inference.
- ›Demo available on Hugging Face Spaces for public testing.
- ›Performance achieved on an M4 Max Mac, demonstrating feasibility on consumer hardware.
- ›Showcases WebGPU as a viable path for high-performance, client-side AI without dedicated GPUs.
A developer demonstrated the LiquidAI/LFM2.5-230M model running entirely in a web browser via custom WebGPU kernels, achieving 1,400 tokens per second on an M4 Max Mac. The implementation builds on kernels originally developed for Fable 5 (before its shutdown) and Opus 4.8, showcasing efficient on-device inference. A Hugging Face Space provides a live demo for testing. The breakthrough highlights the potential of WebGPU for high-performance, client-side AI workloads without requiring dedicated hardware.
Source: LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels. Read the full piece at the source.
Demonstrates practical WebGPU-based inference for LLMs, reducing dependency on server-side hardware and enabling edge AI applications.
Opens opportunities for privacy-focused, low-latency AI products that run entirely in-browser, reducing cloud costs.
Highlights advancements in on-device AI, potentially disrupting cloud-based inference markets with more efficient alternatives.
Provides a tangible example of WebGPU's capabilities for AI workloads, useful for learning and experimentation.
Shows that advanced AI can run locally on consumer devices, enhancing privacy and accessibility.
- WebGPU
- A modern graphics and compute API for web browsers, enabling GPU acceleration for JavaScript applications.
- Tokens/sec
- A metric measuring the speed of a language model's inference, indicating how many tokens it can process per second.
- GGUF
- A file format for quantized large language models, optimized for efficient inference on consumer hardware.
- Kernel
- A low-level function that performs a specific computation, often optimized for hardware acceleration.
AI bias estimate: Neutral technical demonstration; no overt bias detected. (Automated estimate, not a definitive judgement.)
Summary and analysis generated by AI (mistral). Always verify against the original sources.

Suno launches Spark incubator program to feed independent artists to its AI machine

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

DeepSpec - a deepseek-ai Collection
