2025

On Offline RL:

Part 1: What is it, Benefits of Offline RL, How Offline RL work, and Results.
Part 2: Why Offline RL is not widely used - Challenges of offline RL.
Part 3: Recent advancement that could make offline RL more feasible.

1. Introduction

In part 1, we established the benefits of Offline RL. But given all those benefits and results. Why offline RL is not widely used yet?

Based on my experience in Autonomous Vehicle and LLM and my limited exposure to robotics, I can argue that offline RL is practically not used (autonomous vehicle companies mostly use BC, GAIL, and Online RL, LLM uses BC and Online RL, Board Games - AlphaGo, AlphaZero, and MuZero uses Model Based RL and MCTS, Robotic OpenVLA, Pi policies uses BC and Online RL).

In fact we can argue that most of the current real-world successes of RL have been achieved with variant of on-policy RL algorithms (e.g., REINFORCE, PPO, GRPO) which require fresh, newly sampled rollouts from the current policy or slightly older policy, and cannot reuse old data. This is not a problem in some settings like board games and LLMs, where we can cheaply generate as many rollouts as we want. However, it is a significant limitation in most real-world problems. For example, in autonomous vehicles and robotics, it takes more than several months in the real world to generate the amount of samples used to post-train a language model with RL. Yet, Offline RL hasn’t been used, why?

We can categorize the challenges with offline RL in two main categories:

Practical Challenges
Core Technical Challenges

2. Practical Challenges

2.1. Engineering, Scalability, Complexity, and the Human factor

Offline RL training algorithms requires a significant amount of memory and training of multiple models simultaneously (e.g. multiple critics, target critics, policies, behavior models (generators), and/or ensembles) and requires at least 2x the wall-clock time compare to BC. While this works for small policy models, it makes it hard to adopt for large models. One of the appeal of GRPO and RLOO is that they forgo the need for value model which make LLM RL training feasible and faster.

Offline RL often depend on implementation-level details (architecture tweaks, state normalization, entropy removal, action sampling strategy) that are not always part of the algorithm’s mathematical core and requires more effort to tweak and tune.

RL in general is less well understood and less popular compared to Behavior Cloning and given that in the last 15 years most of the attention was on: scaling, self supervised learning, and representation learning engineers tend to leverage techniques that are better understood.

Those issues while seem to be superficial should not be underestimated, team and companies prefer methods that are fast, stable and easy to tweak so they can iterate quicker.