The Evaluation Gap — Kevin Clark

When an enterprise deploys an AI system for professional work, it faces a question that sounds simple and is not: is this good enough?

Good enough for what? Good enough compared to what baseline? Good enough according to whose standards? Good enough to justify the cost? Good enough to trust with decisions that matter?

Most organizations cannot answer these questions about the AI systems they have already purchased. That is not a technical gap. The technical methods for evaluating AI systems are well developed and getting better every quarter. It is a leadership gap, and it is causing organizations to make investment decisions in both directions, funding tools that quietly fail and killing tools that quietly work, on the basis of intuition and anecdote rather than evidence.

The misallocation runs larger than most executives realize because it is invisible by construction. A tool that was never measured cannot show up as a failure in any report. It shows up as a renewal line item, a shrug in a steering committee, and a workforce that has drifted back to the old way of working while the dashboard claims four hundred licensed seats.

The three ways organizations fail to evaluate

In my work running AI strategy and deployment for a professional services firm, and in the executive communities I facilitate where enterprise leaders compare notes on exactly these decisions, I see the same three failure patterns over and over. Nearly every stalled AI investment falls into one of them.

Optimistic deployment. The organization ships without any evaluation framework at all. The demo was impressive, the vendor’s case studies were impressive, and the implicit theory is that value will be obvious once people start using it. This pattern fails quietly, which is the worst way to fail. There is no moment when the tool is declared dead. Usage just decays, the champion changes roles, and eighteen months later someone in finance asks what this line item is. Because no baseline was ever captured, nobody can say what the tool was supposed to improve, which means nobody can say whether it did. The investment is unfalsifiable, and unfalsifiable investments are how budgets die.

Evaluating against the wrong criteria. The organization does evaluate, but it evaluates the model instead of the job. Benchmark scores, leaderboard positions, and vendor bake-offs on general tasks tell you which system is most capable in the abstract. They tell you almost nothing about whether a system can produce a client deliverable in your house style, handle your industry’s edge cases, or fit inside a workflow your team will tolerate. The gap between benchmark performance and job performance is enormous and systematically underestimated, because benchmarks are legible and jobs are not. I have watched organizations select the objectively stronger model and get worse outcomes, because the evaluation never included the only test that mattered: this work, done by these people, under these constraints.

Pilot limbo. The organization runs a pilot but never agreed in advance on what would count as success, so the pilot cannot end. Advocates point to the wins. Skeptics point to the failures. Both are right, because nobody pre-committed to a threshold that would settle the question. The pilot gets extended, then extended again, consuming attention and goodwill until everyone involved quietly stops showing up. Pilot limbo is the most expensive failure mode of the three, not because of the direct cost but because it teaches the organization that AI initiatives do not conclude. They just fade. That lesson poisons every initiative that follows.

What a working evaluation discipline looks like

At my firm, every AI initiative passes through the same evaluation architecture, and the architecture was built before most of the tools were. Four practices carry the weight.

Baseline before anything. You cannot measure improvement against a workflow you never measured. Before our largest redesign, we pulled activity data from Salesforce to quantify exactly how many hours were going into the meetings we suspected were low-value. That number became the baseline, and the baseline is what made the eventual result, more than $250,000 in annualized capacity returned, a fact rather than a claim. The discipline of capturing the baseline does something else as well: it forces the team to articulate, in numbers, what problem the tool is supposed to solve. A surprising number of proposed deployments do not survive that conversation, which is the cheapest possible place for them to die.

Define success before the pilot, not during it. Every pilot we run launches with defined success metrics and a defined evaluation window. The team agrees up front what result means scale, what result means iterate, and what result means kill. Pre-commitment is the entire point. It converts the end-of-pilot conversation from a negotiation between advocates and skeptics into a reading of results everyone already agreed to honor. Pilots end on schedule, in one of three states, and the organization learns that AI initiatives conclude.

A structured rubric for build versus buy. We maintain a formal evaluation rubric that has now been applied to more than ten platforms, scoring each against the same dimensions: fit to the specific workflow, integration cost, data handling, vendor trajectory, and the realistic internal alternative. The rubric matters less for any individual score than for what it prevents, which is decisions made on demo charisma. Every vendor demo is impressive. The rubric asks the questions the demo was designed to avoid.

Feedback loops that are qualitative and quantitative at once. During every pilot we collect both the numbers and the narrative on a weekly cycle, because each catches what the other misses. The metrics catch slow degradation that enthusiastic users will not report. The conversations catch friction and workarounds that never show up in usage data until adoption has already collapsed. Evaluation is not a phase that precedes deployment. It is a loop that runs for the life of the system.

Evaluation cuts both ways

The case for evaluation discipline is usually framed as defense: stop wasting money on tools that do not work. That is half the value. The other half is offense, and it is the half leaders miss.

An organization that cannot measure AI performance does not just overfund its failures. It underfunds its successes, because a tool that is genuinely returning capacity has no way to prove it and no advocate with evidence in hand. Skepticism wins by default in unmeasured environments, and the cost of skepticism winning by default, in an environment where the technology is improving this fast, compounds every quarter. Some of the most expensive AI decisions of the next few years will be the deployments that were quietly working and got cut anyway, because nobody could show a number.

The question every executive should be able to answer about every AI system in their organization is the one this essay opened with: good enough for what, compared to what, according to whom, at what cost. Those are not data science questions. They are management questions, the same ones leaders have asked about every consequential investment for a century. The organizations answering them are not moving slower than their competitors because of the discipline. They are moving faster, because they know which bets are working, they kill the rest without ceremony, and every deployment that survives carries its own proof.