By a Silicon Architect
For the last three years, the entire technology industry has been held hostage by a single company.
We all know the drill. If you wanted to train an AI model, you bought NVIDIA H100s. If you wanted to run an AI model, you rented NVIDIA H100s. You waited six months for delivery. You paid whatever price Jensen Huang asked for. You said “Thank You” for the privilege of handing over your venture capital.
NVIDIA wasn’t just a chip company; it was the oxygen supply for the AI boom. Its stock price became the heartbeat of the Nasdaq. Its software moat, CUDA, was considered impenetrable—a fortress of libraries and optimizations that no competitor could hope to breach.1
But as we close out 2025, if you listen closely to the hum of the data centers, the sound is changing.
The era of “NVIDIA or Nothing” is ending. The Green Team’s monopoly is finally showing cracks, not because their chips are getting worse, but because the market is shifting under their feet. We are moving from the era of Training (where NVIDIA is God) to the era of Inference (where economics matter more than raw power).
And in this new era, the barbarians aren’t just at the gate; they are already inside the server rack. Here is why the “Jensen Tax” is about to be repealed by Google, Groq, and the hyperscalers themselves.
1. The “Training vs. Inference” Pivot
To understand why NVIDIA is vulnerable, you have to understand the lifecycle of an AI model.
Phase 1 is Training. This is when you teach GPT-5 to speak English. It requires massive clusters of GPUs talking to each other at light speed. It is a brute-force scientific problem. NVIDIA’s NVLink interconnect technology makes them the undisputed king here. If you are training a frontier model, you still buy NVIDIA.
Phase 2 is Inference. This is when you actually use the model. Every time you ask ChatGPT a question, that is an inference task.
In 2023, 90% of AI compute spend was Training.
In 2025, that ratio has flipped. 80% of spend is now Inference.
Training happens once. Inference happens billions of times a day, forever.
And here is the problem for NVIDIA: Using an H100 for inference is like using a Ferrari to deliver pizza. It works, but it’s wildly inefficient. It burns too much power. It costs too much money.
CFOs are waking up. They are looking at their cloud bills and realizing they are paying a “Training Premium” for an “Inference Task.” They don’t need the raw horsepower of a Blackwell B200 to serve a customer support chatbot. They need something cheaper, cooler, and more specialized.
2. The Sleeping Giant Wakes: Google TPU
The biggest crack in the NVIDIA narrative appeared earlier this year, not from a chip rival like AMD, but from a customer: Apple.
When Apple released their technical paper on “Apple Intelligence,” buried in the footnotes was a bombshell: Their foundation models were trained not on NVIDIA GPUs, but on Google TPUs (Tensor Processing Units).2
For years, Google kept the TPU as a secret weapon for its own internal use (Search, YouTube, Waymo).3 But with the Trillium (TPU v6) generation, Google has aggressively opened the doors to outsiders.
Why the TPU wins on Inference:
Architecture: GPUs (Graphics Processing Units) were originally designed for video games.4 They have a lot of baggage. TPUs were designed from day one specifically for matrix multiplication (the math of AI). They are leaner.
The System: You don’t rent a TPU; you rent a “Pod.” Google’s networking between chips is arguably better than NVIDIA’s for massive scale.
Cost: Because Google designs the chip and owns the data center, they can undercut NVIDIA’s margins.5 Renting a TPU v6 pod on Google Cloud is often 30-50% cheaper per token than the equivalent H100 cluster on Azure.
For a startup burning cash, that 50% discount isn’t a luxury; it’s survival. Midjourney, Character.ai, and Apple have all moved significant workloads to TPUs. The monopoly isn’t leaking; it’s being siphoned off by Mountain View.
3. The Speed Demon: Groq and the LPU
If Google is winning on cost, Groq is winning on time.
Groq (not to be confused with Elon Musk’s Grok) is a hardware startup that took a radical approach. They threw out the GPU architecture entirely and built an LPU (Language Processing Unit).
NVIDIA GPUs are based on HBM (High Bandwidth Memory). This memory is fast, but it’s still a bottleneck. The chip has to wait for data to travel from memory to the core.
Groq put the memory directly on the chip (SRAM).6
The Result: Deterministic, insane speed.
NVIDIA H100: Generates roughly 50–80 tokens per second (reading speed).
Groq LPU: Generates 500+ tokens per second (blitz speed).
Why does this matter? Voice and Agents.
If you are chatting with an AI Voice Assistant, you cannot tolerate a 2-second lag. It kills the conversation. You need instant response. Groq is the only chip fast enough to make AI feel like a real-time conversation.
Furthermore, for Agentic Workflows—where an AI has to “think” through 50 steps to solve a problem—speed is everything.
If an agent takes 10 seconds per step, a 50-step task takes 8 minutes. No user will wait for that.
On Groq, that same task takes 40 seconds.
Groq isn’t trying to replace NVIDIA for training (their chips have too little memory for that). But for serving the models? They have created a new category where NVIDIA’s architecture simply cannot compete physically.
4. The “CUDA Moat” is Drying Up
The strongest argument for NVIDIA has always been software. CUDA.
“You can’t leave NVIDIA,” the engineers said, “because all the code is written in CUDA.”
This was true in 2020. It is false in 2025.
The industry collectively decided that it hated being locked into one vendor. So, Meta (Facebook) spent billions developing PyTorch 2.0. OpenAI developed Triton.
These constitute a new “Middleware Layer.”
Developers today write code in PyTorch.7 PyTorch then “compiles” that code down to the hardware.8
If you have an NVIDIA chip, it compiles to CUDA.
If you have a Google chip, it compiles to XLA.
If you have an AMD chip, it compiles to ROCm.
The abstraction layer has gotten so good that for 95% of developers, the underlying hardware is invisible.
I recently migrated a Llama 3 inference pipeline from an NVIDIA A100 cluster to a Google TPU v5e pod. It took two days. Five years ago, that would have been a six-month rewrite.
The “Vendor Lock-in” that justified NVIDIA’s 75% gross margins is evaporating. Software is eating hardware’s advantage.
5. The Hyperscaler Rebellion (AWS & Meta)
Finally, the biggest threat to NVIDIA comes from its biggest customers.
Amazon (AWS), Microsoft (Azure), and Meta are tired of paying the “Jensen Tax.” Every billion dollars they pay NVIDIA is a billion dollars of margin they lose.
So, they are building their own chips.
AWS: Has Trainium 2 and Inferentia 3. If you use Anthropic Claude on AWS, you are likely running on Amazon silicon, not NVIDIA. Amazon offers massive discounts to customers who switch.
Meta: Has the MTIA chip.9 Meta is deploying these by the millions to run their recommendation algorithms (Instagram Reels). That is a workload that used to go to GPUs. Now it stays in-house.
Microsoft: Has Maia.
These chips don’t need to be better than NVIDIA. They just need to be “Good Enough.”
If Amazon’s chip is 80% as fast as NVIDIA but costs Amazon 50% less to deploy, Amazon wins. And they will ruthlessly push their cloud customers toward their own silicon.
6. The Verdict: NVIDIA Becomes the “Intel of AI”
Does this mean NVIDIA is doomed? Is the stock going to zero?
No. NVIDIA will remain the Ferrari.
For the absolute cutting edge—for training GPT-6, for scientific simulations, for the tasks where money is no object—NVIDIA will reign supreme. Their hardware engineering is still the best in the world.
But they are losing the “commodity” market. They are losing the “everyday inference” market.
We are watching NVIDIA transition from being the only player to being the premium player. They are becoming the Intel of the 90s—dominant, yes, but facing an AMD (Google) and an ARM (Groq) that are chipping away at the empire from both the low end and the high speed.
For the industry, this is great news.
Competition means lower prices.10 The cost of intelligence (tokens) will crash in 2026 as chip supply floods the market.
Diversity means resilience. We won’t be reliant on a single supply chain from Taiwan.
The Moat is leaking. The water level is dropping. And for the first time in the AI era, it looks like we might actually be able to swim across.
Prediction for 2026: Watch for a major foundational model provider (maybe Mistral or Cohere) to announce a strategic partnership exclusively with Groq or Google TPU, marketing “Low Latency AI” as a differentiator. The “Powered by NVIDIA” badge is about to lose its shine.
