Back to Blog
By 7 min read

Agent Systems Are Becoming Control Planes

Over the last three weeks I kept coming back to one pattern: the AI projects and papers that felt most useful were not promising a more magical chat box. They were building the control surfaces around agents: sandboxes,...

AI AgentsRAGEvaluationLLM ServingPlatform EngineeringDeveloper Tools

Over the last three weeks I kept coming back to one pattern: the AI projects and papers that felt most useful were not promising a more magical chat box. They were building the control surfaces around agents: sandboxes, gateways, memory systems, retrieval planners, evaluation loops, cost monitors, and shared workspaces.

That pattern feels important because it changes what "AI engineering" means. The model is still central, but the quality of an agent product increasingly depends on the system around the model. Who is allowed to act? What context is trusted? Where do secrets live? How is retrieval evaluated? Can a trace be replayed? What happens when serving behavior changes under batching?

The more I read, the more the answer looked like normal systems engineering applied to a new kind of workload.

Agents need runtime boundaries

The first theme is execution. Once an agent can run code, touch files, or call tools, it becomes a workload with identity, permissions, state, logs, and failure modes.

That is why projects like kubernetes-sigs/agent-sandbox caught my attention. The idea is not just "run agent code in a container." It maps agent execution onto Kubernetes primitives: a sandbox custom resource, stable identity, persistent storage, templates, claims, and controllers. That framing makes agent execution visible to platform teams in a language they already understand.

Infisical Agent Vault points at a related boundary: credentials. If prompt injection is part of the threat model, handing raw API keys to an agent is a weak default. A credential broker can mediate outbound requests, enforce scopes, log access, rotate secrets, and keep privileged values away from generated code.

This is the practical lesson for me: an agent should not own more authority than the task needs. The safer architecture is to make every capability cross a boundary that can be inspected.

Model calls are becoming production traffic

Another repeated pattern was the rise of AI gateways and data planes. A single prototype can call an LLM SDK directly. A product with multiple models, providers, guardrails, budgets, and fallback paths needs something more disciplined.

Bifrost, Plano, and ccx all sit in this space. They treat model calls as traffic that should be routed, observed, guarded, cached, retried, and measured. That feels similar to how backend systems evolved around databases, queues, and internal services. At first, every call is direct. Later, the product needs a layer that standardizes policy.

This also connects to cost. Tools like ccusage and CodexBar are small compared with full agent platforms, but they solve a real operational problem: developers need to know what their coding agents are spending, when limits reset, and which tools are consuming the most budget.

My takeaway is that "just call the model" is becoming the wrong abstraction for serious products. Model access should have the same engineering treatment as other production dependencies: routing, telemetry, limits, and failure behavior.

Retrieval is turning into query planning

RAG also looked less like a single vector search box and more like a planning problem.

RAISE frames RAG design as architecture search. Instead of guessing chunk sizes, retrieval depth, query rewriting, reranking, and compression settings, it treats those choices as a search space with budgets and evaluations. That is a useful mental model because RAG quality is often local to the task, corpus, and latency target. There may not be one globally good configuration.

OmniRetrieval pushes the idea further. Real systems have text corpora, SQL databases, RDF graphs, and property graphs. Flattening all of that into embeddings is convenient, but it throws away schemas and source-native operations. A stronger retrieval layer can choose the right backend, generate the right query, execute with permissions, and return evidence with provenance.

This maps closely to how I debug software. I do not only ask for semantically similar chunks. I grep exact strings, read neighboring files, inspect database rows, follow links, and check metadata. A good agent retrieval layer should expose multiple evidence paths, not pretend every source is the same.

Memory has to become infrastructure

The word "memory" is easy to overuse in AI products. The recent tools made it feel more concrete.

supermemory, Engram, Rowboat, and agentmemory all point to memory as a data system, not just a prompt trick. Useful memory needs sources, freshness, deletion semantics, confidence, access control, and inspection.

That is also why I built boring-agent-memory as a deliberately small memory layer. It indexes trusted local files with SQLite FTS5/BM25, returns source-grounded snippets, and keeps the rule explicit: canonical files first, recall second, model memory last.

The papers made the same point from another angle. Work like LongMemEval-V2, SkillOps, and SkillsVote treats long-running agent behavior as something that needs maintenance and governance. A skill library can accumulate technical debt. A memory system can preserve stale or contradictory context. A useful agent should have ways to prune, merge, attribute, and evaluate what it remembers.

For my own projects, this means I would rather show a small memory layer with visible sources than claim broad "long-term memory" in vague terms. If the user cannot inspect or correct memory, it is not a reliable product feature yet.

Evaluation should grade the trace, not just the answer

Evaluation was another strong thread. The basic pass/fail question is not enough for agent systems because an agent can get the right final answer while wasting many tool calls, citing weak evidence, or mutating state through a path that should not have been allowed.

RedundancyBench is useful because it asks whether successful agent steps were necessary. That turns waste into something measurable. Repeated tool calls, irrelevant exploration, and duplicated reasoning become trace quality problems, not just vibes.

CiteVQA focuses on attribution. For document intelligence, an answer is only trustworthy if the cited evidence actually supports it. That lesson applies beyond PDFs. A research agent, recruiting agent, or code review agent should preserve the evidence path behind each claim.

GLIDE adds a statistical angle: evaluation should include uncertainty and annotation-cost tradeoffs, not only point estimates from an LLM judge. CLEAR similarly treats agent evaluation at system, trace, and node levels.

The lesson I keep returning to is simple: agent observability is product quality. I want traces that show tool calls, observations, retries, citations, cost, latency, and state changes. Without that, it is hard to know whether the model succeeded or merely got lucky.

Serving details still shape product behavior

It is tempting to think serving is just infrastructure below the product layer. The recent papers pushed against that.

MarginGate studies batch-invariant inference and shows why deterministic decoding is not only a sampling setting. Precision, batching, kernels, and verification can affect whether the same request produces the same output. If a product depends on replayable outputs for code generation, grading, or compliance, serving behavior becomes part of the correctness contract.

Other work around continuous batching, KV cache compression, and managed adapter serving points in the same direction. LLM serving is not only about tokens per second. It is about latency, cache reuse, determinism, adapter rollout, memory transfer, and concurrency behavior. For agents, those details matter even more because a single user task can become many model turns and tool calls.

That changes my checklist. A serious LLM feature should log model version, prompt shape, decoding settings, cache behavior, provider route, and enough serving metadata to debug regressions. Otherwise the application may look non-deterministic even when the product code did not change.

What I would build next

If I turned this month of notes into one portfolio project, I would build a small agent control plane rather than another chat interface.

  1. A sandboxed execution layer where each run has scoped filesystem access, clear credentials, and a visible audit log.
  2. A retrieval router that can choose between exact search, vector search, SQL, and document metadata, then return provenance with the result.
  3. A memory store with source attribution, confidence, freshness, deletion, and a small UI for correction.
  4. A trace viewer that shows tool calls, observations, citations, latency, token cost, and state mutations.
  5. An evaluation harness that checks not only final answers but also evidence support, redundant steps, and mutation safety.

That project would be much more convincing to me than a demo that only shows a model completing a task once. The point is not to make the agent look autonomous. The point is to make autonomy inspectable.

My takeaway

The strongest idea from this set of drafts is that agents are becoming systems with control planes.

A reliable agent needs a workspace, but the workspace needs permissions. It needs memory, but memory needs governance. It needs retrieval, but retrieval needs source-aware planning. It needs tools, but tools need credential boundaries. It needs evaluation, but evaluation needs trace-level evidence. It needs fast inference, but serving still has determinism and latency contracts.

That is the kind of engineering I want to get better at: not just prompting a model until it works once, but building the operating surface that makes the system measurable, reviewable, and reliable enough to use.