Last week I wrote about the model you build on but don't control: the one a vendor can pull, reprice, or cut off. There was a detail in it worth going into. A colleague was sure her new model was eighty percent better than anything she'd used all year. Most of us know that feeling, how good the output can be, sometimes, but not always. That is the thread I want to pull. Even when it feels better, how would you know for sure, across the full range of what you do? And would that hold on the next kind of problem, or survive the next version?
That gap is the subject here. Last week was the loss you can see: a model pulled or repriced, where you know the instant it happens. This is the one you don't. The model is still there, but its capabilities are ones you can't fully map, and can't trust from one release to the next.
I ran into it myself recently. A newer version of the model I lean on as a writing coach replaced the one before it. Same chat box, same screen, a version number the only thing marking the change. On the same base materials, with the same request to sharpen an argument, it was worse: way wordier, quicker to praise, slower to push back. Newer was not better.
Let me be precise, because that is the whole point here. It was worse for me, on this specific task. On a benchmark the new version may well score higher. But "better" is not something a model is. It's something a model is, on a specific task. And the task that matters to you is rarely the one the headline measures.
So how would you know what your model can do, if all you ever do is type into the same input box? And how would you know whether the next version is better, the same, or worse? The interface stays still while the thing behind it moves, and the sameness hides everything that matters.
Three different things make generative AI uncertain, and it helps to keep them apart. First, it is non-deterministic: the same input can give you a different answer from one run to the next. Second, it is a black box: you cannot see where it is strong or weak, and a higher benchmark score can say almost nothing about the narrow capability your process leans on. It won't tell you whether to run a task on high effort or low, either. Third, it moves: the model behind the same screen changes over time, usually without telling you. Sonnet 5 landed this week, claimed and benchmarked to be close to Opus 4.8 on average and ahead of it on some knowledge work. Better on average. But better at your task? You cannot tell from the box. What I ran into with my writing coach was that third kind, and I only caught it because it praised my drafts far more than the sharp critique I had come to rely on.
That last part is the part most people get wrong. The risk is not spread evenly. It depends on two things: how bad a wrong answer would be, and how easily you would notice it was wrong.
Map your AI uses against those two questions and most of them sort themselves out. Where the stakes are low, or where a mistake shows itself the second it appears, uncertainty is fine. Where the stakes are high but you can build a cheap check, a second source, a rule, a human glance, it is manageable. Implement the check, and move on.
The danger is the last group: high stakes, and errors you would not catch. And here the two questions stop being separate. You only notice a wrong answer if someone on your team can grade it. A frontier model will attempt any task you hand it, including the ones nobody you employ is equipped to judge. So the work where you most need the model is often the exact work where you are least able to tell when it has failed. That is not a gap you close by reading a release note.
You might assume public benchmarks cover you here. They do not. Even METR's time-horizon work, a serious AI evaluation effort, is narrow even at its best: mostly software tasks, scored on how long a job a model can finish on its own, and even there the number describes something narrower than the headline graph suggests. That is the state of the art in seeing what a model can do, and it still doesn't map the corner your business actually runs on, because no public test knows what that corner is. You cannot outsource the validation of your own niche to someone else's leaderboard.
For those cases, and only those, there is a concrete reason to run part of the stack yourself. Not as a principle. As a tool. When you control the model, you can pin a version, keep a known-good baseline, and test each new release against your own tasks before you trust it with them. You cannot do this for everything, and you shouldn't. The aim should never be to remove every risk. It is to identify which you can live with and which you cannot, then spend your effort on the second kind.
The labs will keep shipping versions and calling each one better. The word is carrying less than it looks. Better at what, for whom, on which task. For most of what you do, you will never need to answer that. For the few things your business actually runs on, there is only one way out of the danger zone: stop taking the vendor's word for "better" and check it against the one benchmark that counts, your own work.