by Haytham ElFadeel - [email protected]
2025
Reinforcement learning from verifiable rewards (RLVR) has emerged as a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, outcome reward models (ORMs) provide only a single binary signal at the end of a complete reasoning trajectory, forcing all intermediate steps and tokens to share identical credit or blame regardless of their individual contribution. This sparse reward structure creates a fundamental credit assignment problem: the gradient signal is either discounted into oblivion for early tokens or diluted uniformly across all tokens, obscuring which actions actually mattered. Process reward models (PRMs) attempt to address this by providing per-step feedback, but existing approaches either rely on expensive human annotations that do not scale, or on Monte Carlo (MC) estimation methods that conflate future outcome potential with current-step correctness, leading to noisy and often misleading supervision signals.
We propose a hierarchical credit assignment framework that operates at three levels of granularity. At the step level, we introduce a progress reward that measures the change in probability of reaching a correct answer before and after each reasoning step, as evaluated by an independent prover policy. Unlike traditional PRMs that assess step correctness in isolation, this progress signal is grounded in actual outcome probabilities and captures whether a step made the problem more solvable. At the token level, we introduce an entropy-weighted advantage that uses the entropy of the policy's output distribution at each token position as a multiplicative modulator on the step-level advantage. The combined effective reward integrates outcome, progress, and token-level signals into a single coherent framework. We present the complete formulation, data collection procedures, model training methodology, and reinforcement learning integration. Experimental results are forthcoming.
Large language models trained with reinforcement learning have achieved remarkable results across mathematical reasoning, code generation, and other structured problem-solving domains (Ouyang et al., 2022; Lightman et al., 2023; Shao et al., 2024). A central paradigm in this line of work is reinforcement learning from verifiable rewards (RLVR), where the model generates a complete chain-of-thought response and receives a binary outcome reward indicating whether the final answer is correct. This approach has proven effective when combined with algorithms such as GRPO (Shao et al., 2024), RLOO, or standard PPO. However, outcome reward models impose a fundamental limitation: they provide no information about which intermediate actions contributed to or detracted from the final result. Consider an LLM generating a 1,000-token chain-of-thought response to a mathematics problem. The model receives a single binary signal at the end. Standard temporal-difference methods face a dilemma when propagating this terminal reward backward through the trajectory.
The Discount Dilemma. With a typical discount factor (e.g., γ = 0.99), the effective credit arriving at the first token is γ⁹⁹⁹ ≈ 0.00043 of the terminal reward. Early decisions that establish the entire reasoning strategy receive negligible gradient signal. The reward effectively vanishes before reaching the tokens that matter most.
The Dilution Dilemma. The natural response is to set γ = 1, which is exactly what modern LLM RL methods do (GRPO, RLOO, and similar algorithms typically use undiscounted returns). This preserves signal magnitude but introduces reward dilution: by indiscriminately assigning identical credit to all 1,000 tokens for a single binary outcome, we obscure the causal link between early decisions and the final result. Every token—filler words, formatting characters, genuinely critical reasoning steps—receives the same advantage estimate. The result is high-variance gradient estimates that wash out the signal from the few tokens that actually mattered.
This is the core tension in LLM credit assignment: discount too aggressively and the signal vanishes; discount too little and the signal gets diluted across irrelevant actions. Neither extreme solves the credit assignment problem. What is needed are methods that can identify which actions mattered, not merely propagate a blanket signal backward.
Process Reward Models (PRMs) attempt to provide finer-grained supervision by scoring each reasoning step independently. However, existing PRMs face two fundamental challenges. First, the labeling problem: human-annotated PRMs (Lightman et al., 2023) are expensive and do not scale, while automated methods based on Monte Carlo estimation (Wang et al., 2024) introduce systematic biases we discuss in Section 2. Second, and more fundamentally, step correctness is not the right signal for credit assignment. A step can be perfectly correct yet make zero progress toward the answer (e.g., restating the problem), and a step can appear unconventional yet represent the key insight that unlocks the solution. What matters for RL training is not whether a step is correct in isolation, but whether it brought the model closer to solving the problem.
We propose a hierarchical credit assignment framework that addresses the limitations of both ORMs and traditional PRMs through three complementary mechanisms. First, we introduce a progress reward that measures the change in success probability before and after each step under an independent prover policy, providing a grounded, automated, and scalable per-step signal. Second, we propose an entropy-weighted token advantage that modulates the step-level signal at the individual token level based on decision entropy, directing more credit to tokens where the model made genuine choices. Third, we provide a unified effective reward formulation that integrates outcome, progress, and token-level signals, along with complete algorithms for data collection, model training, and RL integration.
Outcome reward models assign a single scalar score to an entire reasoning trajectory based on whether the final answer is correct (Cobbe et al., 2021). Given a mathematical problem p and a solution s, the ORM is trained with a binary cross-entropy loss to predict answer correctness. ORMs are straightforward to train since labels can be obtained automatically by comparing the model's final answer against a ground truth. This has made them the dominant approach in RLVR pipelines for mathematical reasoning.
However, ORMs suffer from a fundamental limitation: they provide only trajectory-level feedback. All steps in a correct trajectory receive identical positive signal, and all steps in an incorrect trajectory receive identical negative signal. When used as rewards in RL, this sparse signal leads to the credit assignment challenges described above. With enough rollouts per question, statistical averaging across trajectories can partially resolve step-level differences—if step 3 is consistently the point of failure across many incorrect rollouts, the averaged advantage for step 3 will eventually be lower than for other steps. But this requires a large number of rollouts per question to achieve reasonable variance reduction, making ORM-based RL sample-inefficient.