by Haytham ElFadeel - [email protected], Stan Peshterliev
2021
Research done while @ Meta Inc.
Knowledge distillation (KD) transfers information from a teacher model (or an ensemble) to a student model. In addition to compressing large teachers into smaller deployable students, KD is frequently used in a same-capacity regime to improve a single model by distilling “dark knowledge” from an .
This paper formalizes same-size logit distillation with temperature scaling, discusses why naïve same-size KD is often teacher-bounded, and introduces Reliability-Weighted Knowledge Distillation (WKD): a simple per-example weighting scheme that downweights teachers that are incorrect on that example while renormalizing correct teachers to preserve logit scale. The method is evaluated on an internal experiment using an Geoffrey Hinton-style distillation objective and shows improved SQuAD v2.0 F1/EM relative to standard KD in this setting.
Supervised classification typically trains with one-hot labels. While effective, one-hot targets provide no graded similarity structure among incorrect classes. KD addresses this by training the student to match a teacher’s full predictive distribution, which implicitly encodes class similarity and uncertainty (“dark knowledge”).
KD is commonly used in two regimes:
This paper focuses on (2): same-size distillation from an ensemble.
Let the student produce logits $z^{(s)} \in \mathbb{R}^K$ over K classes and teachers produce logits $z^{(t,m)}$ for teacher $m \in \{1,\dots,M\}$.
Given logits z, the temperature-softened softmax is:
$\displaystyle q_i(T) \;=\; \frac{\exp(z_i / T)}{\sum_{j=1}^{K}\exp(z_j / T)}$.