by Haytham ElFadeel - [email protected]

2024

On Offline RL:

1. Introduction

Reinforcement learning (RL) is a subset of machine learning that deals with sequential decision-making. The key difference from supervised learning lies in the data assumptions: supervised learning assumes data is i.i.d. (independent and identically distributed), meaning samples are independent of one another. In sequential decision-making, however, each action influences subsequent samples—creating temporal dependencies and non-stationary data distributions that require fundamentally different algorithmic approaches.

The RL framework formalizes this as an agent interacting with an environment, receiving observations and rewards, and learning a policy that maximizes cumulative reward over time. This formulation naturally captures problems where actions have long-term consequences: robotics control, game playing, resource allocation, and recommendation systems.

The RL landscape encompasses numerous paradigms, categories, and algorithms, shaped by factors like data availability, simulation cost, and whether a dynamics model or reward signal is accessible. While most RL successes to date have been online and on-policy (more on that in part 2), this blog series focuses on offline RL—an interesting paradigm when real-world interaction is expensive, dangerous, or unavailable (e.g. autonomous vehicles).

2. What is Offline RL?

In standard (online) RL, an agent learns by interacting with an environment: it takes actions, observes rewards, and updates its policy through this feedback loop. Formally, at each timestep the agent is in state $s_t$, takes action $a_t$, receives reward $r_t$, and transitions to state $s_{t+1}$. The objective is to maximize expected discounted return:

$\displaystyle J(\pi)=\mathbb{E}{\tau\sim \pi}\left[\sum{t=0}^{\infty}\gamma^t r_t\right]$

This work well, but in real-world applications we often can't afford the luxury of extensive exploration. You can't let a robot randomly experiment with surgical procedures, or allow a self-driving car to learn collision avoidance through actual collisions.

The natural alternative is simulation—but simulation has its own drawbacks: it can be computationally expensive, struggles to capture complex real-world dynamics, and still leaves sim-to-real domain adaptation as an open problem.

So practitioners often turn to behavior cloning (BC), treating the decision-making problem as supervised learning: given state $s$, predict the action $a$ that an expert would take. This is simple and scales well—which is why most robotics and autonomous driving companies relied on it for years. But BC ignores the sequential nature of the problem: errors compound over time as the agent drifts into states the expert never visited, a phenomenon known as distribution shift or covariate shift.

Offline RL offers a middle ground. Like BC, it learns entirely from a fixed dataset of previously collected experience—no environment interaction required. But unlike BC, it retains the RL objective: maximize cumulative reward rather than simply mimicking demonstrations. This allows offline RL to potentially improve upon the behavior policy that generated the data, rather than just imitating it.

Note: The behavior policy $\pi_\beta$ is whatever policy (or mix of policies) generated the dataset. It could be a human expert, a scripted controller, an earlier learned policy, or a combination of all three.

3. Why Offline RL and why not BC

Behavior Cloning is simple and often effective when data is high-quality and near-expert. But real-world data is rarely so clean. Offline RL offers two main advantages: