by Haytham ElFadeel - [email protected]

2025

Abstract

Group Relative Policy Optimization (GRPO) and REINFORCE Leave-One-Out (RLOO) has emerged as an effective and memory-efficient alternative to Proximal Policy Optimization (PPO) for reinforcement learning in large language models. By eliminating the need for a learned value function and instead computing advantages relative to a group of sampled responses, GRPO significantly reduces computational overhead. However, we identify four systematic issues in GRPO's formulation that can degrade training stability and generalization: (1) a response-level length bias introduced by per-response token normalization, (2) a question-level difficulty bias arising from per-prompt standard deviation normalization, (3) a difficulty-dependent sampling bias inherent to group-relative advantage estimation under finite rollout budgets, and (4) instability from unbounded advantage scaling in sparse-reward settings. To address this, we introduce Adaptive Policy Optimization (APO), an unbiased and adaptive optimization method that addresses all four issues. Empirical results on mathematical reasoning benchmarks indicate that APO exhibits improved training stability, token efficiency, and improves the average Pass@1 of GRPO by up to 8%

1. Introduction

Reinforcement learning with verified reward (RLVR) has become a central paradigm for training reasoning-oriented LLMs. Earlier approaches leveraged, Proximal Policy Optimization (PPO; Schulman et al., 2017), requires learning a separate value function to estimate per-token advantages—a significant computational and memory burden at the scale of modern LLMs. Group Relative Policy Optimization (GRPO; Shao et al., 2024) addresses this by sampling a group of G responses per prompt and computing advantages relative to the group's mean reward, entirely removing the critic network. This design has proven effective in practice, particularly for mathematical reasoning tasks such as those targeted by the DeepSeek series of models.

Despite its empirical success, we argue that GRPO's formulation introduces several systematic biases that become increasingly problematic as models and training regimes scale. These biases manifest as observable training artifacts including length distortion, uneven learning across difficulty levels, and training instability in sparse-reward settings.

We identify four distinct issues. First, GRPO normalizes each response's token-level losses by the response's own length $|o_i|$, creating a coupling between sequence length and effective gradient magnitude. This favors brevity for correct responses and inadvertently reduces penalties for verbose incorrect responses. Second, GRPO normalizes advantages by the within-prompt reward standard deviation, which causes prompts with low reward variance (very easy or very hard questions) to receive disproportionately large update weights. Third, even without standard deviation normalization, group-relative advantage estimation under finite sampling is inherently biased as a function of prompt difficulty: the algorithm preferentially learns from mid-difficulty prompts while systematically under-learning from hard prompts. We formalize this in Proposition 1, showing that the expected positive advantage signal scales as $O(p)$ for small success probability $p$. Fourth, the linear dependence of policy gradients on advantage magnitude, combined with sparse or binary rewards and per-prompt reweighting, leads to heavy-tailed gradient distributions that can cause abrupt policy shifts.

Concurrent work by Liu et al. (2025) identifies the length and standard deviation normalization issues (our Issue 1, and 2) and proposes Dr. GRPO, which removes this normalization. However, Issues 3, and 4 remain unaddressed. As we argue, these issues are not independent: the finite-sampling difficulty bias (Issue 3) persists even after removing variance normalization, and the instability from unbounded advantages (Issue 4) can be exacerbated by difficulty corrections applied to address Issue 3.

To address all four issues in a unified framework, we propose Adaptive Policy Optimization (APO), a method built on three principles: (1) length-invariant gradient aggregation, which replaces per-response normalization with a fixed token budget constant; (2) a history-aware capability anchor that tracks the model's evolving success rate and applies signed, difficulty-aware corrections to advantage weights; and (3) a bounded, asymmetric soft gate that replaces hard ratio clipping with a smooth sigmoid-based trust region, providing robust gradient attenuation for off-policy samples without hard discontinuities.

APO retains the core computational advantages of GRPO—no learned value function, group-based advantage estimation—while correcting the identified biases and improving training stability. Table 1 summarizes the issues addressed by each method.

Table 1. Comparison of issues addressed by PPO, GRPO, Dr. GRPO, and APO.

Issue PPO GRPO Dr. GRPO APO
No critic required
Length bias (§3.1)
Std normalization bias (§3.2) N/A
Finite-sample difficulty bias (§3.3)
Unbounded advantage instability (§3.4)
Smooth trust region

2. Background

2.1 Prior Work

Foundations. Policy gradient methods for reinforcement learning trace back to REINFORCE (Williams, 1992), which provides unbiased but high-variance gradient estimates. Trust Region Policy Optimization (TRPO; Schulman et al., 2015) introduced constrained optimization to stabilize updates, and Proximal Policy Optimization (PPO; Schulman et al., 2017) simplified this via a clipped surrogate objective that has become the dominant approach for RLHF in large language models. The effectiveness of RL-based post-training with verifiable rewards has been demonstrated at scale by DeepSeek-R1 (Guo et al., 2025a), which achieved significant reasoning improvements across mathematics, coding, and question-answering benchmarks.

Efficient policy optimization without value models. A key limitation of PPO is the requirement for a learned value function, which introduces substantial memory and compute overhead at LLM scale. GRPO (Shao et al., 2024; Guo et al., 2025a) addresses this by estimating advantages relative to a group of sampled responses, eliminating the critic entirely while maintaining strong performance. GPG (Chu et al., 2025) further simplifies the optimization pipeline by removing surrogate losses, critics, and KL constraints. RLOO (Ahmadian et al., 2024) takes a related approach, using leave-one-out baselines within a REINFORCE framework to reduce variance without a value function.

Bias and variance corrections. As group-based and critic-free methods have gained adoption, several works have identified specific failure modes. Dr. GRPO (Liu et al., 2025) identifies and mitigates length bias. DAPO (Yu et al., 2025) employs dynamic sampling strategies to improve rollout quality. OPO (Hao et al., 2025) derives an optimal baseline for group-based estimators to reduce gradient variance. Ahmadian et al. (2024) argue that variance is not a significant concern in LLM RL; we revisit this claim and show that variance remains problematic under limited rollout budgets and long-horizon generation (Section 3.4).

Despite this rapid progress, the systematic biases arising from the core design choices in group-based estimators—including length-dependent normalization, difficulty-dependent signal attenuation under finite sampling, and unbounded advantage scaling—remain largely uncharacterized. Existing corrections address individual symptoms in isolation. In this work, we provide a unified analysis of these interacting failure modes and propose a method that addresses them jointly.

2.2 Preliminaries