By Hanbin TanMay 10, 20268 min read

What I Learned This Week About Building Reliable AI Agents

AI AgentsRAGEvaluationDeveloper ToolsLLM ServingSoftware Engineering

This week I kept coming back to one idea: the interesting part of AI engineering is moving away from the single model call. The strongest tools and papers I read were not just asking whether a model can write code, answer a question, or control a browser. They were asking what has to exist around the model so the system can be trusted.

That surrounding system is starting to look like a real engineering stack. It has memory, retrieval, specs, runtime visibility, safe interfaces, review gates, evaluation harnesses, and serving constraints. The model matters, but the product quality comes from the loop around it.

Here are the points from this week that feel most worth carrying forward.

Memory needs curation, not just storage

A longer prompt is not a memory system. It is just a bigger bag of context.

The projects that stood out to me treated memory as something with structure. agentmemory frames memory as a shared service for coding agents, with concepts like confidence, attribution, lifecycle management, hybrid search, and deduplication. Rowboat takes a more user-facing angle: it builds a knowledge graph from work artifacts and keeps notes editable as Markdown.

The paper SkillOS made the same point from the research side. Memory grows, but usefulness does not automatically grow with it. Skills need names, scopes, examples, failure cases, and a way to be pruned or merged when they stop helping.

That is a practical lesson for any agent project I build. If an assistant remembers everything, it will eventually remember noise. A useful memory layer needs curation. It should answer basic questions: where did this fact come from, when should it expire, who can edit it, and why did the agent retrieve it for this task?

For a portfolio project, I would rather show a small, inspectable memory system than claim the agent has "long-term memory" in a vague way. A Markdown-backed skill repository, a retrieval index, and a simple UI for inspecting saved context would be more convincing than a hidden database full of chat logs.

Retrieval is an interface, not a magic search box

RAG is often explained as "embed documents, retrieve top-k chunks, put them in the prompt." That is useful, but this week made me think that interface is too narrow for many agent tasks.

The paper Beyond Semantic Similarity argues for direct corpus interaction. Instead of only giving the agent a similarity API, expose tools like exact search, file reads, metadata filters, local context expansion, and command logs. That maps well to codebases and technical docs. When I debug a system, I do not only ask for the five semantically closest paragraphs. I grep error strings, follow filenames, inspect nearby code, and check exact symbols.

SIRA, along with its GitHub project, approaches retrieval from another angle. It tries to compress multi-round exploratory search into one stronger corpus-aware retrieval action. The lesson is not that dense retrieval is bad. The lesson is that retrieval should have a budget. More agent steps can add cost and latency without making the answer more reliable.

The same theme appeared in developer tooling. CodeGraph pre-indexes repository structure so coding agents do not waste many tool calls rediscovering call graphs, routes, and symbols. A good index can reduce both cost and confusion.

My takeaway: retrieval quality is partly a data-structure problem. A strong agent should have multiple ways to find evidence, and the trace should be reproducible enough that a human can debug a bad answer. If the agent gives the wrong response, I want to see what it searched, what it ignored, and why a specific document was returned.

Evaluation should be part of the product, not a screenshot in the README

The most credible agent systems this week had evaluators.

Google DeepMind's AlphaEvolve is a good example. The impressive part is not only that a Gemini-powered agent proposes algorithms or code changes. The important part is the closed loop: generate candidates, test them against a measurable objective, and keep the improvements that survive validation. That shape is much stronger than a generic coding-agent demo because the agent is constrained by a score.

The paper When No Benchmark Exists pushed this further. It argues that safety scores are only meaningful under a fixed evaluation contract: scenario pack, rubric, judge, sampling settings, rerun budget, and the decision the score is allowed to support. I like that framing because it turns evaluation into something versioned and reviewable, not just a number.

Citation evaluation is another place where the gap between polish and reliability is obvious. Cited but Not Verified parses Markdown reports, retrieves cited sources, and checks whether each claim is actually supported. That is the kind of test a deep research agent should have by default. A report with citations can still be wrong; the links only matter if the cited content supports the claim.

This changes how I think about AI projects for my own homepage. A stronger project is not just "I built an agent that writes reports." A stronger project is "I built an agent that writes reports, stores its evidence, verifies citations, records failure cases, and exposes evaluation runs." The second version shows engineering judgment.

Agent interfaces need security boundaries

As agents move from text responses into real UI and desktop control, interface design becomes a safety problem.

Google A2UI stood out because it does not let the agent ship arbitrary frontend code to the client. The agent sends declarative JSON that maps to trusted components. That idea feels simple and important: agent-generated interfaces should be expressive, but they should still be data crossing a boundary, not code with unlimited power.

UI-TARS Desktop points in the opposite direction: agents operating through a real computer interface. That is powerful because the desktop is where real work happens. It is also risky. A system that can see the screen, click buttons, use a browser, and call tools needs permissions, observability, and recovery paths.

The same concern showed up in browser tooling. Chrome DevTools MCP gives agents access to runtime facts such as console messages, network requests, screenshots, traces, and browser actions. That is extremely useful for frontend debugging. It also means the agent may see sensitive browser state. Better tool access needs clearer trust boundaries.

The product lesson is that UI is not decoration. For agent systems, UI is part of the control plane. The interface should make context visible, permissions explicit, and dangerous actions reviewable. If an agent needs approval, the approval should happen through a structured interface, not a vague paragraph asking the user to trust it.

Serving and platform constraints still matter

It is easy to talk about agents as if the only question is model intelligence. The systems work this week reminded me that deployment constraints still shape the product.

ds4 is a narrow local inference engine for DeepSeek V4 Flash on Apple Metal. Its interesting lesson is not that every project needs a custom engine. It is that a specific serving workload can justify specific design choices, including model-specific loading, KV cache persistence, and compatibility endpoints.

The EMO work from Ai2 points in a related direction. If applications often need only part of a model's capability, maybe serving should become more modular. A model that exposes useful expert subsets could let systems trade memory, latency, and accuracy more explicitly.

Even desktop packaging mattered in this week's tool scan. zero-native is not an agent framework, but it matters because many agent products eventually need to become actual applications. Startup time, binary size, permissions, native APIs, update flow, and security policy all affect whether an AI tool feels trustworthy.

That is a useful reminder for me. AI projects still live inside normal software constraints. Latency, memory, packaging, observability, and permissions are not implementation details to clean up later. They are product features.

What I would build next

If I turn this week's reading into one portfolio project, I would build a technical research agent with five visible layers:

a hybrid retrieval interface that supports semantic search, exact search, file reads, metadata filters, and source-span expansion;
a small skill memory system where repeated successful workflows become editable Markdown runbooks;
a citation verifier that parses generated reports and checks link validity, relevance, and claim support;
an evidence dashboard that shows searches, retrieved sources, tool calls, and evaluation results;
a deployment surface with explicit permissions and clear failure recovery.

That project would not need to be huge. In fact, it would be better if it were small enough to inspect. The point would be to show the engineering system around the agent: how it finds context, how it remembers, how it evaluates itself, how the user audits it, and how it fails safely.

My takeaway

The best idea from this week is simple: reliable AI agents are less like smart chat boxes and more like software systems with control planes.

A useful agent needs memory, but memory needs curation. It needs retrieval, but retrieval needs debuggable interfaces. It needs tools, but tools need permissions. It needs evaluation, but evaluation needs a claim contract. It needs good models, but serving constraints still decide what can actually ship.

That is the kind of AI engineering I want to get better at. Not just prompting a model until the answer looks good, but building the surrounding system that makes the answer measurable, inspectable, and safe enough to use.