2025

On Offline RL:

Part 1: What is it, Benefits of Offline RL, How Offline RL work, and Results.
Part 2: Why Offline RL is not widely used - Challenges of offline RL.
Part 3: Recent advancement that could make offline RL more feasible.

1. Introduction

In Parts 1 and 2, we established what offline RL is, its potential benefits, and why it remains largely unused despite these benefits. The challenges are both practical (engineering complexity, reward annotation burden, brittle O2O transitions) and fundamental (quadratic error accumulation over the horizon).

In this final part, we examine recent research that directly addresses some of these obstacles.

2. Addressing the Curse of Horizon

The $H^2$ error accumulation in TD learning is perhaps the most fundamental barrier to scaling offline RL. Two recent approaches attack this problem from different angles.

2.1. Horizon Reduction: SHARSA and n-step Methods

Park et al. (2025) conducted a systematic study of offline RL scaling with datasets up to 1B transitions—1000× larger than typical offline RL benchmarks. Their findings are striking: standard offline RL methods (IQL, CRL, SAC+BC) completely fail on complex, long-horizon tasks even with massive datasets. Performance saturates far below optimal, regardless of model size or hyperparameter tuning.

Value horizon reduction: Use n-step returns instead of 1-step TD. This reduces the number of recursive updates by a factor of n, directly attacking the $H^2$ term.

Policy horizon reduction: Use hierarchical policies that decompose long-horizon goals into shorter subgoal-reaching problems. A high-level policy $\pi^h(w|s,g)$ outputs subgoals; a low-level policy $\pi^\ell(a|s,w)$ executes them. This happens to be the modern approach in robotics

Another idea, SHARSA (Scalable Horizon-Aware RSA), which combines:

Flow-matching behavioral cloning for expressive policy representation
n-step SARSA for value learning (reduces value horizon)
Hierarchical policy structure (reduces policy horizon)