AI Research 75% 1 min readJul 3, 2026, 11:33 PM

Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]

30-second summary

A new experiment called Native Factorized Weights trains transformers with factorized weights from initialization, showing promising results. The approach replaces standard linear layers with W = V·Uᵀ, eliminating the need for post-hoc SVD or LoRA adapters.

Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]

Key takeaways

The Native Factorized Weights (NFW) experiment trains transformers with factorized weights from initialization
The approach eliminates the need for post-hoc SVD or LoRA adapters
The method has the potential to simplify the training process and improve model efficiency
The NFW method can lead to a corpus-determined optimal rank

Full story

The Native Factorized Weights (NFW) experiment introduces a new approach to training transformers. By initializing every linear layer with factorized weights, represented as W = V·Uᵀ, the model can learn to optimize its rank directly from the data. This method, also referred to as 'Sliver layers', has the potential to simplify the training process and improve model efficiency.

The experiment's findings suggest that this approach can lead to a corpus-determined optimal rank, which could have significant implications for the development of more efficient and effective transformer models.

The NFW method differs from traditional approaches, which often involve training a standard transformer and then compressing it using techniques like SVD or LoRA adapters. By incorporating factorization from the outset, the model can learn to adapt to the specific requirements of the task at hand.

Further research and experimentation are needed to fully explore the potential of the NFW method and its applications in various areas of machine learning.

Source: Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D]. Read the full piece at the source.

Why this matters

Developers

offers a new approach to training transformers, potentially leading to more efficient models

Businesses

Investors

Students

Everyone

could lead to breakthroughs in natural language processing and other areas of machine learning

Glossary

SVD: Singular Value Decomposition, a technique used for matrix decomposition
LoRA: Low-Rank Adaptation, a method for adapting pre-trained models to new tasks

Sources · 1

Training transformers where every layer W = V·Uᵀ from initialization reveals a corpus-determined optimal rank - looking for arXiv endorser (cs.LG) [D] ↗

TickrWire

News - 75th USARIC pioneers AI solutions for OSJ 26 - DVIDS

1 min read3h ago

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

1 min read4h ago

TickrWire

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]

1 min read5h ago

TickrWire

Contrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]

1 min read7h ago