☀️ AI Morning Minute: GDPval

The benchmark that asks whether AI can do your actual job

Apr 17, 2026

Most AI benchmarks test whether a model can solve math problems, write code, or answer exam questions. None of those tell you whether the model can do the thing you get paid for on a Tuesday morning. OpenAI built a benchmark that tries to answer that question directly, and the results are making people uncomfortable.

What it means

GDPval is an AI evaluation framework from OpenAI that measures model performance on real knowledge work tasks across 44 occupations and 9 industries. The occupations were selected from the sectors that contribute the most to US GDP, from software developers and lawyers to registered nurses and mechanical engineers.

Each task was designed by professionals averaging 14 years of experience, and the deliverables aren’t short text answers. They’re presentations, spreadsheets, diagrams, reports, and schedules. Industry experts then grade the AI’s output blindly against work done by a human professional.

Why it matters

The results are closer than most people expected. In blind evaluations, experts rated the best AI models’ work as equal to or better than the human reference in nearly half of all tasks. Claude and GPT-5 both approached expert-level quality on structured, well-defined assignments.
The speed and cost gap is enormous. AI completed the tasks roughly 100 times faster and 100 times cheaper than human experts. That doesn’t mean it’s ready to replace anyone, but it means the economics of knowledge work are shifting. A first draft that used to take a professional four hours now takes a model four minutes.
The benchmark has real limitations that matter. Every task is self-contained with clear instructions and reference files. Real jobs aren’t like that. Real jobs involve ambiguous requirements, back-and-forth with clients, reading between the lines, and building trust over time. GDPval tests the discrete, computer-based steps of a job, not the messy human parts that hold the work together.

Simple example

A driving test checks whether you can parallel park, signal a lane change, and stop at a red light. It doesn’t check whether you can navigate rush hour traffic while your kids fight in the backseat and your GPS reroutes you through construction.

GDPval is the driving test for AI and knowledge work. It proves the model can handle the isolated tasks. It doesn’t prove the model can handle the job. But if you can’t pass the driving test, you definitely can’t handle the commute, and right now AI is passing.

The AI Morning Minute

Discussion about this post

Ready for more?