☀️ AI Morning Minute: Humanity's Last Exam

The test designed to be impossible for AI, and it is...for now

Apr 24, 2026

AI models have been acing every test we throw at them. Graduate-level science questions, PhD-level reasoning, medical licensing exams. The scores keep climbing and the benchmarks keep losing their usefulness.

So nearly a thousand researchers from around the world got together and built one final exam, specifically designed to stay hard.

What it means

Humanity’s Last Exam (HLE) is a benchmark of 2,500 expert-level questions spanning over a hundred subjects, from mathematics and physics to ancient languages and bird microanatomy. It was created by the Center for AI Safety and Scale AI, published in Nature in 2026, and built with a specific rule: if an AI could answer a question correctly during the testing phase, that question was removed. The exam was engineered to sit at the edge of human expert knowledge, in territory where even PhD specialists average around 90%.

Every question has a single correct answer that’s verifiable but can’t be solved by searching the internet.

Why it matters

The best AI models are still failing most of it. As of April 2026, the top model (Gemini 3.1 Pro Preview) scores 44.7%. GPT-5.4 hits 41.6%. Claude Opus 4.6 with extended thinking reaches 34.4%. Human domain experts average about 90%. That’s a gap no other benchmark shows this clearly, because most other tests have been saturated at the top.
The gap is closing fast. Frontier models gained 30 percentage points on HLE in a single year. That pace means a benchmark built to last years could be beaten in months. The researchers designed it as “the final closed-ended academic benchmark,” but whether it stays final depends on whether AI progress keeps accelerating.
It exposes what benchmarks can and can’t tell you. Scoring well on HLE means a model has deep, cross-domain knowledge. But it doesn’t mean the model can do a job, handle ambiguity, or work with a team. HLE tests knowledge. GDPval tests work output. ARC-AGI-3 tests adaptive reasoning. No single exam captures the full picture, and anyone who tells you otherwise is selling something.

Simple example

A university builds a final exam so hard that the best students in every department can barely pass. Then they give it to the AI. The AI bombs it. A year later, they give it again. The AI gets a third of the answers right. Six months after that, it’s closing in on half.

The exam hasn’t changed. The student keeps getting smarter. The professors are running out of harder questions to ask, and that’s the point. HLE isn’t just measuring AI. It’s measuring how much time we have before the tests stop working.

The AI Morning Minute

Discussion about this post

Ready for more?