by Haytham ElFadeel - [email protected]
2025
On Offline RL:
In Parts 1 and 2, we established what offline RL is, its potential benefits, and why it remains largely unused despite these benefits. The challenges are both practical (engineering complexity, reward annotation burden, brittle O2O transitions) and fundamental (quadratic error accumulation over the horizon).
In this final part, we examine recent research that directly addresses some of these obstacles.
The $H^2$ error accumulation in TD learning is perhaps the most fundamental barrier to scaling offline RL. Two recent approaches attack this problem from different angles.
Park et al. (2025) conducted a systematic study of offline RL scaling with datasets up to 1B transitions—1000× larger than typical offline RL benchmarks. Their findings are striking: standard offline RL methods (IQL, CRL, SAC+BC) completely fail on complex, long-horizon tasks even with massive datasets. Performance saturates far below optimal, regardless of model size or hyperparameter tuning.
Value horizon reduction: Use n-step returns instead of 1-step TD. This reduces the number of recursive updates by a factor of n, directly attacking the $H^2$ term.
Policy horizon reduction: Use hierarchical policies that decompose long-horizon goals into shorter subgoal-reaching problems. A high-level policy $\pi^h(w|s,g)$ outputs subgoals; a low-level policy $\pi^\ell(a|s,w)$ executes them. This happens to be the modern approach in robotics
Another idea, SHARSA (Scalable Horizon-Aware RSA), which combines: