A few months ago, I watched a team of external consultants spend over a month building a multi-agent AI setup for Cloud FinOps. Detecting cost anomalies in cloud infrastructure. The kind of problem data scientists have been solving effectively for years using classical statistical methods and traditional machine learning.

The experiment didn't work. The agents couldn't reliably do what a well-designed detection model would have handled in a fraction of the time and cost. I'll come back to what they should have done instead. That's where the real lesson is.

But before we get there, there's some background every leader needs to understand.

Across industries, AI is being presented as a uniformly transformative force. METR research shows exponential progress curves [1]. Vendor decks and consultancy slides argue that anything short of an AI-first strategy is a competitive risk. The message is always the same: AI capability is accelerating across the board, and you need to keep up.

The potential is real. AI is already unlocking capabilities that simply weren't viable before. But open LinkedIn on any given morning and every other post promises it will reshape your entire business by next quarter. That's excitement, not strategy.

The reality is more nuanced than the pitch suggests.

The most dramatic AI improvements over the past two years have been in one domain: software development. That's not a coincidence. Writing code plays directly to what large language models do well. Code is a form of structured text. It has clear success criteria: it runs or it doesn't. And there are vast amounts of high-quality examples to learn from, including decades of open-source software.

But even in this ideal domain, simply pouring more money into training models didn't produce the leaps we're seeing now. AI models showed little meaningful progress on coding tasks until late 2024. The breakthroughs came after that, when teams of expert programmers built what are called coding harnesses around those models: carefully engineered programs that create plans, call the model when code is needed, run checks, connect to developer tools, and guide the entire process. These harnesses contain large amounts of hand-coded logic, something a recent leak of Anthropic's Claude Code source code confirmed in detail. The progress is less "the AI figured it out" and more "a highly specialized tool, built specifically for coding, got better at coding."

That's genuinely impressive and useful. But it raises two questions that every leader should be asking.

First, there are no equivalent harnesses for most of the problems that actually sit on a leadership agenda. Employee productivity. Customer experience. Core business processes and decision-making. These are messy, ambiguous, and deeply context-dependent. The engineering that made AI effective at coding simply doesn't exist for these domains, and there is no evidence it will any time soon.

Second, even if it did, there's no guarantee it would produce the same kind of results. Coding had near-perfect conditions: structured input, clear success criteria, abundant training data. Most business problems don't come with those advantages.

Which is exactly what happened with that Cloud FinOps experiment. The team reached for the most impressive-sounding AI approach without asking a more basic question: what does this problem actually need?

Detecting cost anomalies didn't need agents. It needed statistical methods that have worked reliably for decades, possibly built faster with AI coding tools, which is where those improvements genuinely help. And once anomalies are detected? A large language model could have automatically written human-readable incident tickets for each one. That would have been deploying AI where it actually adds value: not as a replacement for proven methods, but as the layer that makes the output actionable.

Over a month of consulting produced nothing deployable. The right solution, classical detection plus AI-generated communication, should have been the answer here.

When an AI strategy or proposal reaches the boardroom, here's what every leader should be asking.

What would the non-AI solution look like? If nobody has considered one, the proposal is grounded in the technology, not the problem. And if there is a simpler approach, it should be crystal clear why AI, and the cost and complexity that come with it, delivers enough additional value to justify that gap.

How will you measure success? The KPIs need to be established upfront, specific enough to reflect actual outcomes, with a clear timeline for evaluation. A list of fifty metrics enables cherry-picking and storytelling. One or two that capture the real impact force honesty.

And ask yourself whether the proposal exists because AI is the right solution, or because AI projects get attention. Nobody gets promoted for the problems they silently prevent. The unglamorous solution might be cheaper, lower risk, and faster to deploy. But it doesn't get your name mentioned in the meetings where careers are decided.

When someone cites impressive benchmark results, ask: in which domain? And does that domain look anything like ours?

The chart will keep going up. Make sure the people reading it for you know the difference between what AI can do and what they're selling you.

1. If you've come across the METR time horizon chart that's been getting attention recently: it measures how well AI models, combined with heavily engineered coding harnesses, can complete specific software tasks. The latest models show steep improvement. But METR themselves note this captures performance on specific coding tasks, not general AI capability, and the human baseline reflects low-context testers, not experienced professionals. The full methodology is at metr.org/time-horizons.

What Every Leader Needs to Know About AI Progress