Authors: **Yihe Deng (twitter)** and **Yu Yang (twitter)**

Date: 12/23/2024

We sincerely thank Fan Yin for valuable feedback on early draft of this blog.

<aside> đź’ˇ

This blog draws on public information from Arxiv, with all opinions being our own. We hope to explore and summarize relevant and interesting literatures under the topic. Feel free to leave a comment or contact us with suggestions for improvement!

</aside>

Spurious correlation or shortcut learning (Geirhos et al. 2020) in classification task is a concept closely related to reward hacking. — Reward Hacking in Reinforcement Learning

Lilian’s blog offers an excellent overview of reward hacking and briefly touches on its connection to spurious correlations. In this post, we expand on that link by examining how data and offline training processes can inadvertently reinforce spurious features—a phenomenon we refer to as Data-Induced Reward Hacking.

Through this discussion, we hope to shed light on the subtle yet significant role data plays in shaping unintended outcomes, encouraging further exploration of this critical challenge.

1. Introduction

Reward Hacking (Classic RL Context)

Data-Induced Reward Hacking

Bridging the Gap

Tie to Spurious Correlations: Spurious correlations happen when an irrelevant or secondary variable strongly aligns with a target label (human preference). If the model latches onto this correlation, it will optimize for length instead of truly improving textual quality.


2. Spurious Correlation