Issue 02: The Agent Gap

The word “agent” has been doing a lot of work this year. It’s been applied to chatbots with memory, to automation pipelines that call one API, to fully autonomous systems that run for hours without human input. The inflation is severe enough that the word has nearly stopped being useful. But the underlying phenomenon it points to — AI systems that plan, act, observe, and iterate — is real and becoming more capable quickly. The problem is separating what’s real from what’s marketing.

Here’s a useful distinction: there’s a difference between AI that assists and AI that acts. Assistance is what most current “AI features” provide — drafting, summarizing, suggesting, generating. It’s useful. It saves time. But the human still makes every decision and takes every action. Agency is when the AI takes the actions itself — calls the API, writes the file, sends the message, executes the code. The loop runs without requiring human approval at each step.

Most organizations are using the first and calling it the second. That’s not a criticism — assistance is genuinely valuable and significantly easier to deploy reliably. But organizations that believe they’ve deployed agents when they’ve deployed assistance are making planning errors. The capabilities are different. The risk profiles are different. The infrastructure requirements are different. The failure modes are very different.

Three observations from this week:

The production deployment gap is real and widening. There’s a growing divergence between organizations that have agents running in production — actually taking actions, actually affecting systems — and organizations that have pilots, proofs of concept, and demos. The gap is mostly not about model capability. It’s about evaluation infrastructure, error handling, and the organizational willingness to define the precise scope of what the agent is allowed to do. Narrow scope with clear permissions deploys. Broad scope with vague permissions gets stuck in review.

Observability is the bottleneck nobody talks about. You cannot operate agents at scale without trace-level visibility into what they did, why, and what went wrong when they failed. The teams that have figured this out are using LangSmith, Langfuse, or custom instrumentation — and they treat observability as infrastructure, not an afterthought. The teams that haven’t figured this out are debugging production failures through guesswork and log files. The tooling exists. The practice of treating agent traces as first-class operational data is still early.

The cost curves are changing the calculus. Eighteen months ago, running complex agents continuously was cost-prohibitive for most tasks. The combination of more capable smaller models (Haiku, GPT-3.5 successors, Gemini Flash) and better routing — using smaller models for most steps and only escalating to frontier models for genuinely complex reasoning — has made continuous agent operation economically viable for a much wider set of use cases. The teams that built cost models assuming frontier model prices for every token are revisiting those models.

The technical guide to how agentic architectures actually work — what breaks in production, what the 2026 stack looks like — is in the Research section this week. Worth the read if you’re making infrastructure decisions. AI Agents Decoded.

Back next week.