By Haytham ElFadeel - [email protected]
2018
Label smoothing (LS) is a regularization technique that replaces one-hot training labels with soft targets, typically by mixing the hard label with a uniform distribution. The practical effect is to reduce pathological over-confidence (very low-entropy predictive distributions), which often improves generalization and calibration.
However, uniform smoothing implicitly assumes that all incorrect classes are equally plausible. In sequence problems—especially tasks where the label corresponds to a token position (e.g., extractive QA start/end indices) or a boundary—this assumption is often too coarse: near-miss predictions (off by ±1–2 tokens) are typically more plausible than far-away ones.
Gaussian Label Smoothing (GLS) replaces the uniform “noise” distribution with a localized Gaussian kernel centered at the gold token position, allocating more probability mass to nearby positions than to distant ones.
Consider a classification problem with $K$ classes. Let $y\in\{1,\dots,K\}$be the gold label and $p_\theta(k\mid x)$ the model distribution.
One-hot target:
$q(k)=\mathbf{1}[k=y]$
Label smoothing with $\varepsilon\in[0,1$ (uniform prior):
$q_{\text{LS}}(k) = (1-\varepsilon)\mathbf{1}[k=y] + \varepsilon\cdot \frac{1}{K}$
Training minimizes cross-entropy $H(q_{\text{LS}}, p_\theta)$.
A useful view is that Label Smoothing adds a term that discourages extremely peaked distributions, often improving generalization.
In extractive QA and related tasks, models often predict a categorical distribution over token indices. For a sequence of length N, the model predicts:
Let $i^*$ be the gold index (start or end). Define a discrete Gaussian kernel over positions:
$\displaystyle \large g_\sigma(i\mid i^) = \frac{\exp\left(-\frac{(i-i^)^2}{2\sigma^2}\right)}{\sum_{j=1}^{N}\exp\left(-\frac{(j-i^*)^2}{2\sigma^2}\right)}$
Then GLS defines the smoothed target:
$\displaystyle q_{\text{GLS}}(i) = (1-\varepsilon)\mathbf{1}[i=i^] + \varepsilon\cdot g_\sigma(i\mid i^)$
The loss is standard cross-entropy:
$\displaystyle \mathcal{L}{\text{GLS}} = -\sum{i=1}^{N} q_{\text{GLS}}(i)\log p_\theta(i)$