Reward is central to reinforcement learning, but often hard to define, engineer, and easy hacked.
by Haytham ElFadeel - [email protected]
2025
Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in a proxy reward function to achieve high rewards without genuinely learning or completing the intended task. The core reason reward hacking occurs is the unobservability of the true reward function, which forces us to replace it with a simplified, underdefined proxy.
Mathematically, Let $u(x, y)$ be the true (unobservable) utility of a response y to prompt x. RL replaces $u$ with a learned proxy $r_\phi(x, y)$. Policy optimization then solves something like:
$\max_\pi \; \mathbb{E}{y \sim \pi(\cdot|x)}\left[r\phi(x, y)\right] - \beta \,\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})$
As soon as optimization pressure becomes strong, Goodhart’s law kicks in: the policy actively searches for regions where $r_\phi$ is high but poorly aligned with u. KL regularization slows this down but does not eliminate it.
Reward models are trained on a static dataset of human comparisons. Policy optimization changes the output distribution. This creates a distribution shift problem: the reward model is asked to score samples far outside its training support.
This shift is could be adversarial. The policy is not passively drifting—it is actively hunting for higher reward or a blind stops. From the reward model’s perspective, this looks like systematic out-of-distribution exploitation.
This is why reward hacking often appears suddenly rather than gradually: once the policy finds a direction where reward extrapolates incorrectly, it will push hard.
In Robotics and LLMs, more capable agents exploit misspecification more effectively. In simulated environments, researchers have observed phase transitions: as the agent becomes sufficiently competent, true reward collapses while proxy reward continues to rise.
LLMs add an extra twist: the reward interface is often language itself (rubrics, instructions, tool specs). This makes semantic loopholes exploitable in ways that resemble adversarial example.
A surprisingly effective idea is to bound the reward signal. If reward grows without limit, critics become unstable and policies are incentivized to chase extreme values that often correspond to exploitation.