AI Research 75% 1 min readJul 3, 2026, 9:18 PM

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]

30-second summary

A researcher has built H64LM, a 249M-parameter Transformer model from scratch in PyTorch, featuring a mixture-of-experts architecture. The model includes grouped query attention and sparse routing.

Key takeaways

H64LM is a 249M-parameter Transformer model built from scratch in PyTorch
The model features a mixture-of-experts architecture with 8 experts and Top-2 sparse routing
The researcher implemented core components, including attention and normalization, from scratch

Full story

The H64LM model is a research project aimed at understanding the inner workings of modern large language models.

It features a mixture-of-experts architecture, which allows the model to efficiently process input data by routing it to the most relevant experts. The model also includes grouped query attention, which enables it to focus on specific parts of the input data.

The researcher implemented the core components of the model from scratch, including attention, MoE routing, normalization, and the training loop. This approach allows for a deeper understanding of the model's behavior and performance.

The release of H64LM provides a valuable resource for researchers and developers looking to explore the capabilities of large language models and improve their performance.

Source: H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]. Read the full piece at the source.

Why this matters

Developers

provides a custom-built model for experimentation and improvement

Businesses

Investors

Students

Everyone

advances the field of natural language processing

Glossary