DemoPSD: Disagreement-Modulated Policy Self-Distillation
Researchers propose DemoPSD, a method to improve on-policy self-distillation for LLMs by modulating disagreement between teacher and student models to reduce overfitting and privileged information leakage.
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that
Source: DemoPSD: Disagreement-Modulated Policy Self-Distillation. Read the full piece at the source.
Summary and analysis generated by AI (mistral). Always verify against the original sources.
Measuring the Economic Effects of AI - Economic Innovation Group
