2018

Abstract

Label smoothing (LS) is a regularization technique that replaces one-hot training labels with soft targets, typically by mixing the hard label with a uniform distribution. The practical effect is to reduce pathological over-confidence (very low-entropy predictive distributions), which often improves generalization and calibration.

However, uniform smoothing implicitly assumes that all incorrect classes are equally plausible. In sequence problems—especially tasks where the label corresponds to a token position (e.g., extractive QA start/end indices) or a boundary—this assumption is often too coarse: near-miss predictions (off by ±1–2 tokens) are typically more plausible than far-away ones.

Gaussian Label Smoothing (GLS) replaces the uniform “noise” distribution with a localized Gaussian kernel centered at the gold token position, allocating more probability mass to nearby positions than to distant ones.

1. Introduction

Consider a classification problem with $K$ classes. Let $y\in\{1,\dots,K\}$be the gold label and $p_\theta(k\mid x)$ the model distribution.

One-hot target:

$q(k)=\mathbf{1}[k=y]$

Label smoothing with $\varepsilon\in[0,1$ (uniform prior):

$q_{\text{LS}}(k) = (1-\varepsilon)\mathbf{1}[k=y] + \varepsilon\cdot \frac{1}{K}$

Training minimizes cross-entropy $H(q_{\text{LS}}, p_\theta)$.

A useful view is that Label Smoothing adds a term that discourages extremely peaked distributions, often improving generalization.

2. Gaussian Label Smoothing (GLS)

2.1. Where GLS is most natural: token-position classification

In extractive QA and related tasks, models often predict a categorical distribution over token indices. For a sequence of length N, the model predicts:

Start distribution $p_\theta^{(s)}(i)$ over $i\in\{1,\dots,N\}$
End distribution $p_\theta^{(e)}(i)$ over $i\in\{1,\dots,N\}$

Let $i^*$ be the gold index (start or end). Define a discrete Gaussian kernel over positions:

$\displaystyle \large g_\sigma(i\mid i^) = \frac{\exp\left(-\frac{(i-i^)^2}{2\sigma^2}\right)}{\sum_{j=1}^{N}\exp\left(-\frac{(j-i^*)^2}{2\sigma^2}\right)}$

Then GLS defines the smoothed target:

$\displaystyle q_{\text{GLS}}(i) = (1-\varepsilon)\mathbf{1}[i=i^] + \varepsilon\cdot g_\sigma(i\mid i^)$

The loss is standard cross-entropy:

$\displaystyle \mathcal{L}{\text{GLS}} = -\sum{i=1}^{N} q_{\text{GLS}}(i)\log p_\theta(i)$