Preference RL Is Hard: Whose Preference?

2024

Reinforcement Learning from Human Feedback is the dominate paradigm for aligning large language models to human preferences and values. At its core lies a simple idea: learn what humans prefer by having them compare outputs, then train a model to maximize those preferences. But beneath this simple idea is a fundamental problem—whose preferences are we actually capturing?

The standard approach assumes all humans share a single underlying reward function. This assumption is mathematically convenient but empirically false. When we aggregate preferences from diverse annotators, we don't get a representative model—we get a compromised one that may satisfy no one.

The Bradley-Terry-Luce Model

Most RLHF systems use the Bradley-Terry-Luce (BTL) model to learn reward functions from pairwise comparisons. Given two responses $a_w$(winner) and $a_l$ (loser) to a prompt x, BTL models the probability that an annotator prefers $a_w$ as:

$\displaystyle \large P(a_w \succ a_l | x) = \sigma(r^(x, a_w) - r^(x, a_l)) = \frac{e^{r^(x, a_w)}}{e^{r^(x, a_w)} + e^{r^*(x, a_l)}}$

where $\sigma$ is the logistic function and $r^*$ is a latent reward function. The model is trained via maximum likelihood estimation on a dataset of human-labeled preferences.

BTL it's simple, differentiable, and handles noisy labels gracefully through its probabilistic formulation. But it makes one critical assumption—all annotators share the same reward function $r^*$. This is where things break down.

Anyone who worked with annotators before know that this is a false assumption. For example, In my AV experience where we had to annotators label simulation and real driving sense as safe vs not safe or will this result in a collision or not, people vary widely in what they consider safe or not even with very strict guideline and training.

What's Wrong with BTL? Three Core Problems

1. The Averaging Problem

When human preferences are genuinely diverse, BTL doesn't capture that diversity—it averages over it. Consider a scenario where half your annotators prefer detailed, technical responses while the other half prefer concise, accessible ones. BTL will learn a reward function that produces... medium-length, moderately technical responses that neither group actually wants.

This isn't just theoretically concerning. Recent work from MiCRo (Shen et al., EMNLP 2025) proves that when preferences follow a mixture distribution of diverse subgroups, a single BTL model incurs an irreducible error. You cannot fit a unimodal BTL to multimodal preference data without systematic bias.

More formally, if the true preference distribution is a mixture:

$\displaystyle P(a_w \succ a_l | x) = \sum_{k=1}^{K} P(z=k|x) \cdot P(a_w \succ a_l | x, z=k)$

where z indexes latent subpopulations, then standard BTL training will converge to something like the mean of these subpopulations.

2. Annotation Bias and Majority Dominance

The averaging problem becomes particularly pernicious when combined with imbalanced annotation populations. If 80% of your annotators come from one demographic or hold one set of values, the resulting reward model will largely ignore the remaining 20%.

The VPL paper (Poddar et al., NeurIPS 2024) illustrates this with an important example: imagine a college admissions chatbot where wealthy annotators have weak preferences about financial aid information, while low-income annotators strongly need it. Standard RLHF will learn to deprioritize financial aid discussions, actively harming the minority group.

3. The Need for Pluralistic Alignment

The deeper issue is philosophical: should AI systems have a single set of values, or should they adapt to diverse user needs? Many preference dimensions—like verbosity, formality, directness, and humor—have no objectively correct answer. Different users want different things.