The Imagination Gap: Why Not All AI Models Are Doing the Same Job

Most people treat AI model selection as a brand preference. That's the wrong frame — and it leads to one of the most expensive mistakes SME founders make when building AI into operations.

Most people treat AI model selection as a brand preference. Gemini, ChatGPT, Claude — pick the one that feels right, stay consistent, move on.

That’s the wrong frame. And it leads to one of the most common and expensive mistakes I see SME founders make when they start building AI into their operations: using the wrong class of model for the wrong class of task.

There are two distinct capabilities at play. Most people only notice one of them.

What AI models actually do

The obvious capability is generation — producing text, summarising documents, formatting data, drafting emails. Every mainstream AI model does this. The variance in output quality for these tasks, across models, is relatively small. A capable open-source model running on free infrastructure handles them fine.

The less obvious capability is what I call imaginative reasoning — the ability to work with what isn’t written. To read a system description and infer what’s implied. To construct scenarios that don’t exist in the source material. To identify failure modes that aren’t listed anywhere, because they emerge from the intersection of multiple conditions rather than from any single obvious trigger.

This is not the same thing as generation. It requires the model to mentally simulate a system, hold multiple states in tension simultaneously, and ask: what would break this in a way the author didn’t anticipate?

Decision matrix showing when to use open-source AI models versus frontier models, split by task type — generation tasks like formatting and extraction route to open-source, while inference tasks like edge case generation and failure mode analysis route to frontier models.
Two capabilities, two cost structures. Most operational workload sits on the left.

Why the gap matters in practice

Take any reasonably complex business process — a client onboarding flow, a lead qualification sequence, a dynamic pricing logic. Ask an open-source model to generate test scenarios for it. You’ll get coverage of the happy path. You’ll get obvious input variations. You’ll get some standard error states.

What you won’t reliably get are the non-obvious cases. The scenario where a valid input sequence produces an unexpected state because of an undocumented dependency. The edge condition that only surfaces when two otherwise normal behaviours interact in a specific order. The failure that happens not because something went wrong, but because everything went right in the wrong sequence.

Frontier models surface these cases more consistently. Not because they’re “smarter” in some general sense, but because their training has developed a form of systematic imagination: the capacity to extend reasoning beyond the literal scope of the prompt.

Open-source models stay closer to the document. Frontier models reason past it.

The practical decision framework

This is not an argument for always using the most expensive model. It’s an argument for matching model capability to task requirement.

Use open-source models for:

  • High-volume, structured tasks where output pattern is predictable
  • Formatting, extraction, classification, and transformation
  • Any task where the input fully defines the expected output
  • Privacy-sensitive deployments where data cannot leave your infrastructure

Use frontier models for:

  • Tasks requiring inference beyond the source material
  • System design review, edge case generation, and failure mode analysis
  • Strategic synthesis across multiple inputs with implicit interdependencies
  • Any task where the quality of what the model doesn’t generate matters as much as what it does

The cost difference between these two categories is significant. Running open-source models at volume on infrastructure like NVIDIA NIM is effectively free. Frontier model API costs scale. The businesses that get this right are the ones that reserve frontier reasoning for the tasks that genuinely require it, and route everything else to cheaper infrastructure.

The mistake that’s easy to make

The temptation is to standardise. Pick one model, run everything through it, simplify operations.

The problem is that standardising on open-source saves money but introduces invisible quality gaps in tasks that require imagination. Standardising on frontier models is expensive and often unnecessary for the bulk of operational workload.

The right architecture is a task routing layer — a decision about which jobs go where, based on what the task actually requires. Not glamorous. Not a single vendor relationship. But it’s where the real efficiency gain is.

The bottom line

The imagination gap is real, and it’s the thing most model comparisons don’t test for. Benchmark results measure known tasks with defined correct answers. Business value comes from what the model does with ambiguity — and that’s a much harder thing to score on a leaderboard.

If you’re building AI into any process that involves complexity, edge cases, or systems with non-obvious interdependencies, understand which class of model you actually need before you commit to the infrastructure. The wrong choice isn’t just a cost problem. It’s a reliability problem that compounds quietly until something breaks in production.


Alan Law is founder of V8 Global and architect of Axia. Leadership Insight posts examine the structural decisions behind AI-native commercial systems. For the architecture conversation, start here.

Axia

Ready to take the next step?

Join London's executive AI community — events, practical intelligence, and curated introductions for established business leaders.

How Axia uses model routing