AI Research 84% 1 min readJul 2, 2026, 5:58 PM

DemoPSD: Disagreement-Modulated Policy Self-Distillation

30-second summary

Researchers propose DemoPSD, a method to improve on-policy self-distillation for LLMs by modulating disagreement between teacher and student models to reduce overfitting and privileged information leakage.

Full story

On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that

Source: DemoPSD: Disagreement-Modulated Policy Self-Distillation. Read the full piece at the source.

Sources · 1

DemoPSD: Disagreement-Modulated Policy Self-Distillation ↗

Summary and analysis generated by AI (mistral). Always verify against the original sources.

TickrWire

Measuring the Economic Effects of AI - Economic Innovation Group

1 min read3h ago

Claude Code and China: The mechanism is activated when the user sets the ANTHROPIC_BASE_URL environment variable (used for local models)

1 min read4h ago

TickrWire

Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]

1 min read9h ago

TickrWire

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090

1 min read11h ago