The Inference Imperative: Why Running AI Is Now Harder Than Building It
by Nirmal Ranganathan, CTO - Global Public Cloud, Rackspace Technology

Recent Posts
Dimensionamento de soluções de IA em nuvem privada, do PoC à produção
Dezembro 4th, 2025
Um guia abrangente para a implementação do PVC
Novembro 11th, 2025
The Shift to Unified Security Platforms
Outubro 2nd, 2025
Why the Terraform Licensing Shift Matters and What Comes Nex
Setembro 18th, 2025
How Hybrid Cloud Helps Healthcare Balance Agility and Security
Setembro 9th, 2025
Related Posts
AI Insights
Dimensionamento de soluções de IA em nuvem privada, do PoC à produção
Dezembro 4th, 2025
AI Insights
Um guia abrangente para a implementação do PVC
Novembro 11th, 2025
Cloud Insights
The Shift to Unified Security Platforms
Outubro 2nd, 2025
Cloud Insights
Why the Terraform Licensing Shift Matters and What Comes Nex
Setembro 18th, 2025
Cloud Insights
How Hybrid Cloud Helps Healthcare Balance Agility and Security
Setembro 9th, 2025
AI success now depends on inference. Here’s why scaling, cost and architecture challenges are shifting from training to real-world AI operations.
Building AI is now table stakes. Running it reliably, at scale and at cost is the real challenge.
I've spent the past year in conversations with enterprise technology leaders across healthcare, financial services, manufacturing and the public sector, and one pattern kept repeating itself. Not the details, but the shape of the problem.
They'd evaluated the frontier models. They'd run the pilots. They'd shown impressive demos to the board. And then they'd hit a wall that no amount of model experimentation could fix: the wall between “it works in the demo” and “it works in production, every day, at a cost the business can sustain.”
The data confirms what these conversations suggested. McKinsey, in its report “The state of AI in 2025: Agents, innovation, and transformation,” found that 88% of organizations now use AI regularly in at least one business function.
HFS Research, in its report “New Study Reveals That Enterprises Should Embed AI at the Core to Reimagine Their Future,” was even more direct. It found that 83% of Global 2000 enterprises remain stuck in early-stage AI experimentation. Only 17% have integrated AI across their business operations.
That gap between adoption and operationalization is the defining infrastructure challenge of 2026. And it's an inference problem, not a training problem.
Why inference is overtaking training as the primary AI workload
For the past three years, the AI conversation has been dominated by training: bigger models, more parameters, larger compute clusters, higher benchmark scores. The infrastructure investments followed, with massive GPU clusters optimized for training throughput, measured in petaflops and weeks of continuous compute.
But training is episodic. You train a model over days or weeks, and then you're done until the next version. Inference is different. Inference is continuous. It happens every second your AI is live: every chatbot response, every fraud check, every clinical recommendation, every quality inspection on a production line. It never stops.
Gartner confirmed this shift with hard numbers in its press release, “Gartner Says Worldwide AI Spending Will Total $2.5 Trillion in 2026.” It forecasts that 2026 is the crossover year, with inference spending overtaking training. They forecast worldwide AI spending at $2.5 trillion this year, with over half captured by the infrastructure layer. Deloitte, in its report “TMT Predictions 2026: Compute Power and AI,” projects that inference will consume roughly two-thirds of all AI compute by year-end.
This marks a structural realignment in where AI spend goes and what it supports.
And the arrival of reasoning models has accelerated it dramatically. Jensen Huang put it plainly at NVIDIA's February 2025 earnings call: reasoning AI can require 100 times more compute per task compared to standard inference. Models like OpenAI's thinking series, Anthropic's extended thinking and DeepSeek R3 don't just return answers; they think, generating thousands of intermediate tokens before responding. That's a step change in capability. It's also a step change in infrastructure demand.
Layer agentic AI on top and the demand multiplies again. These aren't simple chatbots; they're autonomous systems chaining together multiple model calls, tool invocations and reasoning steps in a single interaction. A single agentic workflow can generate 20-30x as many tokens as a standard exchange. Every agent you deploy becomes an inference-demand multiplier, running around the clock.
The inference paradox: Why AI success drives non-linear cost growth
Here's what makes inference economics fundamentally different from anything enterprise technology leaders have managed before.
With traditional IT, scaling is largely linear: more users, proportionally more compute, reasonably predictable costs. With inference, scaling is non-linear and compounding. The more your product succeeds, the more users adopt it, the more use cases you deploy, the more agents you run, the more your inference bill grows, often unpredictably.
The FinOps Foundation, in its “State of FinOps 2026 Report,” quantifies the exposure: inference can account for 80 to 90% of total AI spend. Not training. Not data preparation. Inference. And their State of FinOps 2026 report shows how quickly enterprises have responded: 98% of organizations now manage AI spend, up from just 31% two years ago. AI cost governance went from edge case to universal practice in 24 months.
Forrester, in its report “Predictions 2026: Tech and Security,” captured the boardroom consequences. It predicts that enterprises will defer 25% of planned AI spend to 2027 because financial discipline was not in place. Fewer than one-third of decision-makers can tie AI's value back to their organization’s financial growth. Forrester described 2026 as the year AI “trades its tiara for a hard hat.” The experimentation phase is over. The accountability phase has begun.
What makes this harder than it looks is the asymmetry in how inference costs accumulate. Output tokens typically cost three to five times more than input tokens. That means prompt design, response verbosity and reasoning depth can dominate your cost, even if you never change the underlying model. A slightly more verbose system prompt, multiplied across millions of daily inferences, can materially shift your margin. This isn't a hyperscaler pricing quirk; it's the fundamental economics of memory-bandwidth-bound serving.
And here's the uncomfortable infrastructure reality underneath those economics: You can own the most expensive GPU accelerators on the market and still fail to monetize them if your serving architecture isn't optimized for how inference works. Recent systems research found cases where 99% of inference latency was consumed by memory transfers, with GPUs drawing only 28% of their rated power while serving requests. The bottleneck isn't compute. It's memory movement, KV cache management and serving stack design. Enterprise GPU utilization typically runs at just 15 to 30% of capacity. Architecture, not utilization, defines the ceiling.
Why smaller, task-specific models outperform large models in production
This is where the most important strategic insight lives for any enterprise planning AI infrastructure today.
Real enterprise AI depends on deploying the right model, precisely where your business operates.
Gartner, in its report on small, task-specific AI models, forecasted that by 2027, organizations will use them at three times the volume of general-purpose models.
NVIDIA's research team published a paper last year titled “Small Language Models are the Future of Agentic AI.” The company that builds the GPUs for the world's largest training runs is signaling that the future is small, specialized and distributed.
The economics drive this shift. A purpose-built 7-billion-parameter model, fine-tuned on your enterprise data and running close to where that data lives, will deliver faster responses, lower latency, better data privacy and significantly lower cost than routing every request to a 400-billion-parameter model in a distant data center.
And we've reached a hardware inflection point that makes this practical. Forty TOPS (tera-operations-per-second) is becoming the mainstream NPU standard for enterprise devices in 2026. Computing capable of running meaningful inference is arriving in laptops, edge appliances and factory-floor controllers. For the first time, inference compute can sit directly next to the data.
This connects to a principle that will reshape enterprise infrastructure over the next three years: data gravity.
Your data has mass. It has location. It operates under regulatory constraints on where it can travel. Increasingly, the right architecture is not to move data to a model, but to move the model to the data.
Think about what this means. Clinical records sitting in a UK-based trust. Manufacturing telemetry streaming from a factory floor. Financial transaction data living under strict regulatory jurisdiction. You don't ship that data across the Atlantic to wherever the cheapest GPU happens to be. You bring a purpose-built, optimized model to the data.
This is the architecture behind the Rackspace and Uniphore partnership. Uniphore’s Business AI Cloud combines five layers — data ingestion, knowledge graph creation, model fine-tuning, agentic orchestration and an exclusive inferencing layer. It also supports deployment entirely within a customer’s network perimeter, including a Rackspace private cloud, a hyperscaler VPC or an on-premises data center.
Critically, Uniphore's automated fine-tuning studio makes purpose-built small language models accessible to organizations that don't have large data science teams. What once required months of expert-intensive work is now largely automated, enabling enterprises across financial services, healthcare and insurance to build domain-specific models fine-tuned on their own data without the overhead that previously made SLMs impractical. A Context-Aware Inference Optimization (CAIO) layer from Rackspace then routes each inference request to the appropriate compute tier based on latency requirements and cost. A real-time contact center agent can draw on high-performance GPU capacity, while a batch contract review runs on lower-cost infrastructure. The result is inference economics that work at enterprise scale.
Research published in January 2025, “Hybrid Edge-Cloud Architectures for AI Workloads,” quantified the impact, showing energy savings of up to 75% and cost reductions exceeding 80% versus pure cloud processing.
IDC, as cited in “Edge AI: The Future of AI Inference Is Smarter Local Compute” by InfoWorld, predicts that by 2027, 80% of CIOs will turn to edge services specifically to meet inference demands. Forrester, in its report “Predictions 2026: Prepare for AI Security and Integrated Network Infrastructure and Operations,” forecasts private AI factories reaching 20% adoption this year, with on-premises servers capturing 50% share.
The playbook shift is from cloud-first to inference-first. And inference-first means right-sized models, deployed close to the data, optimized for the specific task, at the latency and cost profile the business requires.
Why sovereign inference is becoming a core architectural requirement
For enterprises operating in regulated environments, particularly in the UK and Europe, inference introduces a dimension that training did not surface at the same intensity: sovereignty.
When your AI delivers a clinical recommendation, scores a credit application or flags a security threat, that inference decision needs to happen within your jurisdiction, under your governance and with full auditability. This reflects both regulatory expectations and the underlying architecture these use cases require.
McKinsey, in its report “The Sovereign AI Agenda: Moving from Ambition to Reality,” found that 71% of executives, investors and government officials now characterize sovereign AI as an existential concern or strategic imperative. They estimate sovereign AI could represent a $600 billion market by 2030, with up to 40% of AI workloads moving to sovereign environments.
Gartner predicts that by 2027, 35% of countries will be locked into region-specific AI platforms, up from just 5% today.
This reflects a capacity and supply chain reality. Only about 30 countries currently host in-country compute infrastructure capable of supporting advanced AI workloads. The UK is one of them, and it's investing aggressively to maintain that position.
According to “UK Government Announces Billions of Pounds of AI Investment Including Sovereign AI Unit,” published by Tech.eu, the UK’s AI Opportunities Action Plan has moved from aspiration to execution. Isambard-AI is operational at Bristol. Cambridge's DAWN supercomputer is being expanded sixfold. AI Growth Zones are being designated with accelerated planning and priority grid connections. Over £25 billion in private data center investment has been announced. A £500 million Sovereign AI Unit launches in April 2026. AI Pathfinder is deploying £150 million in GPU capacity as the first phase of an £18 billion sovereign infrastructure program.
But sovereign inference extends beyond data residency. It requires control across the full chain: where data lives, where compute runs, where models originate, where inference executes, where telemetry is captured and where operational governance applies. Organizations that own their inference stack build a durable competitive moat.
The convergence of efficient small language models, sovereign infrastructure and edge-ready deployment creates something that didn't exist two years ago: the ability to run production-grade AI inference locally, at viable economics, with governance built in from the start.
What enterprises need to prioritize to operationalize AI at scale
I've laid out the landscape. Let me close with what matters most for action.
First, build inference economics into your architecture decisions from day one. Treat model selection as both a technical and financial decision, measured in cost per inference and cost per business outcome. Implement optimization techniques — quantization, disaggregated serving and continuous batching — as production standards.
Second, design for distributed inference, not just centralized cloud. The assumption that everything runs in a remote cloud region is breaking under the weight of edge latency requirements, data gravity and cost. Build an orchestration layer that routes each inference request based on latency, cost, data jurisdiction and model capability. The orchestration layer becomes the center of your AI infrastructure.
Third, treat sovereign inference as a strategic asset, not a regulatory obligation. Organizations that build this capability now — controlling where decisions are produced, logged and audited — will establish durable advantages. Those that treat it as a compliance checkbox will remain dependent on external infrastructure, jurisdictions and cost structures.
Why the inference economy is now defining enterprise AI
The AI conversation has evolved. The era of obsessing over model parameters and training benchmarks — what I’d call model mania — served its purpose. It proved what AI could do. But proving capability and operationalizing it are different challenges, requiring different infrastructure, economics and disciplines.
The enterprises that will capture AI's value over the next decade won't be the ones with access to the biggest models. They'll be the ones that can serve the right model, in the right place, at the right cost, under their governance, every day.
That's not hype. It's where the math and operational demands are pointing.
This post is adapted from my keynote, “The Inference Imperative: A New Playbook for AI-First Infrastructure,” at Tech Show London 2026.
Tags: