My routing config has changed three times this year. Not because I keep changing my mind. Because every time I ran a real benchmark, the results were different from what I assumed.
Here’s what I know after running GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and Grok 4 through the tasks that actually matter to me.
Nobody wins everything. That was true in 2024 and it’s still true now. What changed is the differentiation got cleaner. The gaps are in specific places, and they’re real.
The Four Contenders
GPT-5.4 is the coding model. Not by a massive margin — SWE-bench separates the top three by fractions of a percentage point — but the gap holds in practice. It also has the deepest Microsoft ecosystem integration, which matters more than it should for enterprise work.
Claude Opus 4.6 is the long-horizon task model. Best tool use, best behavior over extended reasoning chains, richest MCP ecosystem by a significant distance. If you’re building agents, this is where the infrastructure is.
Gemini 3.1 Pro is the reasoning model. 94.3% on GPQA-Diamond. Best ARC-AGI-2 score. And the only model with a context window that actually works at scale — 1M tokens with 76% retrieval accuracy. That last number is what matters. Token count without retrieval accuracy is a marketing number.
Grok 4 is the live data model. X firehose integration gives it real-time access that no other model has. Narrow use case. Absolute answer when you need it.
Benchmarks: What to Ignore and What to Watch
Most benchmark scores are noise. MMLU has been so saturated and gamed since 2024 that differences between top models are meaningless. If a comparison article leads with MMLU, stop reading.
The ones worth watching: SWE-bench for coding (real tasks, not trivia), GPQA-Diamond for scientific reasoning (hard enough that gaming it is expensive), and ARC-AGI-2 for reasoning generalization. Retrieval accuracy at long context matters more than context window size.
Everything else — including most of the numbers on the vendor’s own benchmark pages — assumes you’re running inference on something that looks exactly like their training distribution. You probably aren’t.
Coding
GPT-5.4 leads SWE-bench. But the margin between it and Claude is small enough that I don’t route purely on coding performance. I route on context and task type.
Short, isolated tasks: GPT-5.4. Long-horizon tasks where the model needs to maintain state across hundreds of tool calls: Claude. The difference isn’t raw code quality — it’s drift over long sessions. Claude drifts less.
I found this out the hard way. Built a code review agent on GPT-5.4 because of the SWE-bench number. After 15+ tool calls per session, it started producing subtly inconsistent analysis — same file reviewed differently on call 3 versus call 14. Switched to Claude. Problem gone.
SWE-bench doesn’t measure that.
Reasoning and Science
Gemini 3.1 Pro, and it’s not close. 94.3% on GPQA-Diamond. Best ARC-AGI-2 score of the four models.
I don’t do a lot of PhD-level analytical work. But I use it for complex multi-step reasoning chains where one wrong step compounds into five. Gemini makes fewer of those errors.
If you’re doing anything in science, medicine, or complex policy analysis — this is your model.
Long Context
Gemini 3.1 Pro again. 1M token context with 76% retrieval accuracy at scale.
That second number is the one that matters. Every model claims a big context window. Retrieval accuracy tells you whether the model actually uses what’s in there. 76% at 1M tokens is the only number in this category that’s held up under real testing.
I ran a document analysis task on a 600-page corpus. Gemini found the relevant clause on page 347. Claude found a related clause but missed the specific one. GPT-5.4 hallucinated a clause that wasn’t in the document at all.
Context window claims are advertising. Retrieval accuracy is engineering.
Multimodal
Gemini. Native multimodal processing — no external transcription or vision pipeline required. The other models route through external pipelines for audio and video, which adds latency, cost, and another failure point.
For image analysis and code: all four are competitive. For audio and video at scale: Gemini is the only option that makes architectural sense.
Real-Time and Live Data
Grok 4. Only model with X firehose integration — genuine real-time data access, not a web search add-on.
If your task requires data from the last 24 hours, Grok is the answer. Nothing else is close.
Pricing: The True Cost
Gemini 3.1 Pro: $2 input / $12 output per million tokens. Claude Opus 4.6: $15 input per million tokens. GPT-5.4: middle tier.
The math on routing is simple: use Gemini for volume, use Claude when long-horizon behavior or tool use quality justifies the premium, escalate to GPT-5.4 for isolated coding tasks.
Running everything through Claude at $15 per million when 70% of your requests could be handled by Gemini at $2 is a decision that compounds. At 10M tokens per month, that’s a $91,000 annual difference.
I use Claude every day. But the teams spending the most on AI aren’t doing it because they only use expensive models.
Which Model for What
Production code review, long agentic tasks, MCP-native systems: Claude Opus 4.6.
Scientific reasoning, document analysis, long-context retrieval, high-volume inference: Gemini 3.1 Pro.
Short coding tasks, Microsoft enterprise integration: GPT-5.4.
Real-time data, anything requiring live feeds: Grok 4.
Google Workspace automation: Gemini — ecosystem fit matters more than benchmark differences at this tier.
The Routing Strategy
The most effective teams I know don’t pick a model. They build routing layers.
The logic: route tasks to the cheapest capable model, escalate only when quality signals degrade or complexity requires it. A triage layer classifies the task and sends it to the right endpoint.
In practice: Gemini handles classification and volume. Claude handles anything requiring sustained tool use or complex instruction following. GPT-5.4 handles isolated code generation. Grok handles anything time-sensitive.
Operational cost per output unit drops. Output quality at each tier goes up because you’re using the right model for each job, not asking a premium model to do things a cheaper model handles equally well.
The Bottom Line
Nobody wins everything in 2026. That’s not a hedged conclusion — it’s the design of the market. Each lab found something it’s genuinely better at.
The mistake is treating this as a horse race. You’re not betting on a winner. You’re building a stack.
If you’re serious about building AI systems that aren’t expensive and brittle — knowing which model to use when, how to route between them, how spec-level differences affect architectural decisions — that’s what Deep Stack is for. Each quarter: one layer of the stack, protocol-level depth, practical for teams actually building on top of it.
Also on EchoNerve: MCP Explained 2026 · AI Agents Decoded 2026 · Stop Explaining Yourself to AI · Complete Guide to Claude Chat, Cowork, Code