☀️ AI Morning Minute: Interpretability
We’ve built AI systems that work. We mostly don’t know why.
When an AI model gives you a wrong answer, or a biased one, or a dangerous one, the natural question is: what happened inside the model that produced that output? Right now, we largely can’t answer that. The model takes in text, runs it through billions of mathematical operations, and produces a response.
What it’s actually “thinking” along the way is opaque, even to the people who built it. Interpretability is the field trying to change that.
What it means
Interpretability (also called mechanistic interpretability) is the study of how AI models work on the inside: which parts of the model activate for which concepts, how information flows through the system, and what internal representations the model builds to produce its outputs. Anthropic has made interpretability a core research priority, publishing work on identifying specific “features” inside Claude that correspond to concepts like emotions, intentions, and abstract ideas.
In 2024, their researchers mapped features inside a small language model and found that individual neurons often represent multiple unrelated concepts simultaneously, a phenomenon called superposition that makes models significantly harder to understand than previously assumed.
Why it matters
Safety depends on it. If we can’t see inside a model, we can’t verify that it’s actually doing what we think it’s doing. A model might produce correct outputs for the right reasons, or correct outputs for the wrong ones. Without interpretability tools, there’s no reliable way to tell the difference before something goes wrong at scale.
Interpretability research has already produced useful findings. Anthropic’s work identified a feature inside Claude associated with the “Assistant” token that, when examined, showed signs of what researchers described as a kind of suppressed unease. Whether that reflects something real about the model’s internal state is an open question, but the fact that we can now ask it is new.
Regulators are starting to require it. The EU AI Act includes provisions around explainability for high-risk AI systems: the ability to account for why a model made a particular decision. Interpretability is the technical foundation that makes those requirements possible to meet. Right now, most deployed systems can’t fully meet that bar.
Simple example
A radiologist using an AI diagnostic tool gets a flagged scan. The tool says cancer with 94% confidence. The radiologist wants to know what the model saw. Without interpretability, the answer is essentially: we don’t know, trust the number. With interpretability tools, you could in principle see exactly which pixels or patterns drove the prediction. Medicine, law, and finance all have this problem. The number isn’t enough. You need the reasoning.

