by Haytham ElFadeel - [email protected], Stan Peshterliev

2021

Research done while @ Meta Inc.

Abstract

Knowledge distillation (KD) transfers information from a teacher model (or an ensemble) to a student model. In addition to compressing large teachers into smaller deployable students, KD is frequently used in a same-capacity regime to improve a single model by distilling “dark knowledge” from an .

This paper formalizes same-size logit distillation with temperature scaling, discusses why naïve same-size KD is often teacher-bounded, and introduces Reliability-Weighted Knowledge Distillation (WKD): a simple per-example weighting scheme that downweights teachers that are incorrect on that example while renormalizing correct teachers to preserve logit scale. The method is evaluated on an internal experiment using an Geoffrey Hinton-style distillation objective and shows improved SQuAD v2.0 F1/EM relative to standard KD in this setting.

1. Introduction

Supervised classification typically trains with one-hot labels. While effective, one-hot targets provide no graded similarity structure among incorrect classes. KD addresses this by training the student to match a teacher’s full predictive distribution, which implicitly encodes class similarity and uncertainty (“dark knowledge”).

1.1 Common KD regimes

KD is commonly used in two regimes:

  1. Compression KD (large → small): improve a smaller student by transferring teacher knowledge.
  2. Same-size KD (same capacity): improve a student of comparable size by distilling from a strong single teacher or an ensemble.

This paper focuses on (2): same-size distillation from an ensemble.


2. Preliminaries: Temperature scaling and logit distillation

Let the student produce logits $z^{(s)} \in \mathbb{R}^K$ over K classes and teachers produce logits $z^{(t,m)}$ for teacher $m \in \{1,\dots,M\}$.

2.1 Temperature-softened distributions

Given logits z, the temperature-softened softmax is:

$\displaystyle q_i(T) \;=\; \frac{\exp(z_i / T)}{\sum_{j=1}^{K}\exp(z_j / T)}$.