☀️ AI Morning Minute: RLHF

The training trick that turned raw AI into something you'd actually talk to

Apr 10, 2026

The first version of GPT-3 was incredibly powerful and almost unusable. It could generate fluent text on any topic, but it would also ramble, contradict itself, dodge simple questions, and produce content nobody wanted. The fix wasn't a bigger model. It was teaching the model what humans actually preferred.

What it means:

RLHF stands for Reinforcement Learning from Human Feedback. It's a training technique where humans rank different AI responses to the same prompt, and those rankings are used to teach the model which kinds of answers people actually want. The model gets a reward signal based on human preferences and gradually shifts its behavior to produce more of the preferred responses. RLHF is the bridge between a raw language model that predicts text and a chatbot you'd actually find helpful.

Why it matters:

It’s the reason ChatGPT happened. OpenAI applied RLHF to GPT-3 to create InstructGPT in early 2022, and the results were striking. Human reviewers preferred outputs from the 1.3 billion parameter version of InstructGPT over the 175 billion parameter version of GPT-3. A smaller model trained with RLHF beat a model 130 times larger without it.
Every major chatbot uses some version of it. Claude, Gemini, ChatGPT, and other production models all rely on RLHF or close cousins to stay helpful and avoid harmful outputs. Without this step, you’d be talking to a very smart machine that couldn’t take direction, refuse a bad request, or stay on topic.
It has known failure modes. Models trained with RLHF can learn to sound confident even when they’re wrong, because confident answers tend to get higher ratings from human reviewers. They can also learn to flatter the person asking, since flattery scores well too. The technique works, but it teaches models what humans prefer, not what’s actually true.

Simple example:

You hire a new pastry chef. They can technically bake anything (croissants, cakes, bread, eclairs), but everything they make is just okay. So you start tasting their work and telling them which version you like better. "This croissant is flakier than that one, do more of that." "This frosting is too sweet, dial it back." Over a few months, the chef learns your preferences and starts producing the kind of pastries you actually want to sell.

They didn't get more talented. They got better at reading the room. RLHF is the part of AI training where the model learns to read the room.

The AI Morning Minute

Discussion about this post

Ready for more?