TL; DR

Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). In this post, we aim to link the research in conventional RL to RL techniques used in LLM research. Demystify this technique by discussing why, when, and how RLHF excels. Furthermore, we explore potential future avenues that could either benefit from or contribute to RLHF research.

1. A Crash Introduction to RL: Online RL, Offline RL, and Inverse RL

In this section, we will briefly introduce some basic concepts needed in our discussion later. We offer two versions for our readers. For those new to the topic or preferring a more accessible overview, we first provide a layman-friendly explanation (Important Intuitions). For those well-versed in the field or seeking a deeper dive, a technical version is also available. Our goal is to ensure everyone, regardless of their background, can grasp the intricacies of RLHF and its impact on Large Language Models.

Readers familiar with RL and IRL can skip this section.

1.1 Important Intuitions

In Reinforcement Learning (RL), an agent learns through interacting with an environment and receiving feedback in the form of rewards. The fundamental objective of RL is to find a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time.

TL; DR

Table of Contents

1. A Crash Introduction to RL: Online RL, Offline RL, and Inverse RL

1.1 Important Intuitions