Reward Hacking, Shortcut Learning, and Spurious Correlation

Authors: **Yihe Deng (twitter)** and **Yu Yang (twitter)**

Date: 12/23/2024

We sincerely thank Fan Yin for valuable feedback on early draft of this blog.

<aside> 💡

This blog draws on public information from Arxiv, with all opinions being our own. We hope to explore and summarize relevant and interesting literatures under the topic. Feel free to leave a comment or contact us with suggestions for improvement!

</aside>

Spurious correlation or shortcut learning (Geirhos et al. 2020) in classification task is a concept closely related to reward hacking. — Reward Hacking in Reinforcement Learning

Lilian’s blog offers an excellent overview of reward hacking and briefly touches on its connection to spurious correlations. In this post, we expand on that link by examining how data and offline training processes can inadvertently reinforce spurious features—a phenomenon we refer to as Data-Induced Reward Hacking.

Through this discussion, we hope to shed light on the subtle yet significant role data plays in shaping unintended outcomes, encouraging further exploration of this critical challenge.

1. Introduction

Reward Hacking (Classic RL Context)

Examples include exploiting bugs in a game, finding infinite loops for free points, or subverting the environment in unexpected ways (e.g., an agent that literally crashes the environment to avoid losing).
Key point: The mismatch between the reward function and the true intended goal leads to perverse strategies.

Data-Induced Reward Hacking

In preference-based training setups for Reward Models and LLMs, “reward hacking” can manifest as latching onto spurious correlations in the training data. For instance, a model notices that longer outputs often receive higher human preference labels, so it learns to produce endless text instead of truly higher-quality answers.

Bridging the Gap

Both phenomena reflect Goodhart’s Law—when a proxy metric or reward is overoptimized, it “ceases to be a good metric.”
While RL reward hacking is commonly framed as the model interacting with a dynamic environment, data-induced reward hacking often revolves around learning from offline or static preference data that contains misleading correlations.

Tie to Spurious Correlations: Spurious correlations happen when an irrelevant or secondary variable strongly aligns with a target label (human preference). If the model latches onto this correlation, it will optimize for length instead of truly improving textual quality.

1. Introduction

2. Spurious Correlation