AI Research 75% 1 min readJul 3, 2026, 11:33 PM

Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]

30-second summary

A new experiment called Native Factorized Weights trains transformers with factorized weights from initialization, showing promising results. The approach replaces standard linear layers with W = V·Uᵀ, eliminating the need for post-hoc SVD or LoRA adapters.

Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]
Key takeaways
  • The Native Factorized Weights (NFW) experiment trains transformers with factorized weights from initialization
  • The approach eliminates the need for post-hoc SVD or LoRA adapters
  • The method has the potential to simplify the training process and improve model efficiency
  • The NFW method can lead to a corpus-determined optimal rank
Full story

The Native Factorized Weights (NFW) experiment introduces a new approach to training transformers. By initializing every linear layer with factorized weights, represented as W = V·Uᵀ, the model can learn to optimize its rank directly from the data. This method, also referred to as 'Sliver layers', has the potential to simplify the training process and improve model efficiency.

The experiment's findings suggest that this approach can lead to a corpus-determined optimal rank, which could have significant implications for the development of more efficient and effective transformer models.

The NFW method differs from traditional approaches, which often involve training a standard transformer and then compressing it using techniques like SVD or LoRA adapters. By incorporating factorization from the outset, the model can learn to adapt to the specific requirements of the task at hand.

Further research and experimentation are needed to fully explore the potential of the NFW method and its applications in various areas of machine learning.

Source: Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]. Read the full piece at the source.

Why this matters
Developers

offers a new approach to training transformers, potentially leading to more efficient models

Everyone

could lead to breakthroughs in natural language processing and other areas of machine learning

Glossary
SVD
Singular Value Decomposition, a technique used for matrix decomposition
LoRA
Low-Rank Adaptation, a method for adapting pre-trained models to new tasks
Sources · 1
Related
TickrWire

AI news intelligence. We aggregate, verify, summarise and explain the latest artificial intelligence news from open, legal sources.

Daily AI digest

Top AI stories, summarised, in your inbox each morning.

© 2026 TickrWire. Summaries and analysis are AI-generated and may contain errors.Privacy