
Building AI Agents
Learnings from our recent event where engineering leaders discussed how the AI infrastructure layer is evolving to support long-running, reliable agents at enterprise scale.
Last month, Actively hosted two deep-dive panels at it’s HQ with builders shaping the next wave of enterprise AI agents.
The second session, moderated by Mihir Garimella(Co-Founder, Actively), brought together Justin P. (Engineer, OpenAI), and Nick Huang (Engineering Leader, LangChain) for a discussion on how the infrastructure layer is evolving to support long-running, reliable agents at enterprise scale.
If you missed our first recap on the evolution of AI interfaces from chatbots to coworkers, read it here.
Fewer tools, greater power
Early agent architectures defined a separate tool for every task. The new playbook takes the opposite view: equip the model with a small set of general-purpose primitives and let it improvise.
At OpenAI, Justin described how their most capable systems rely on just two: a terminal and a code-interpreter container. Within that environment, Codex can read, write, and execute — all without external orchestration.
“People are giving agents far too many specialized tools. A generalized terminal and playback container do more than you think.” Justin
LangChain takes a similar stance, distinguishing tools primarily as data connectors rather than function calls.
“Code operations belong in the sandbox; tools shine when they bring in external data. Search, CRM, anything the model can’t see.” Nick
By shrinking the toolbox, teams uncover flexibility they didn’t anticipate and drastically reduce the integration surface that breaks over time. The underlying principle: stop scripting behavior, and start equipping intelligence.
Evaluation is a product loop, not a benchmark
Traditional ML begins by building a benchmark and optimizing toward it. Agent builders invert that process. They start with a few focused evals, ship quickly, observe how customers actually use the system, and expand the eval suite as the product matures. Evaluation becomes a living loop, growing in lockstep with the product itself.
Nick captured it simply: “Start with three examples.” Small, high-signal tests expose how an agent behaves in real contexts and surface edge cases no static dataset could predict.
“When you’re building agents, you don’t need hundreds of eval samples. Start with three and iterate.” Nick
Justin extended the point toward reinforcement learning (RL) — the stage where behaviors we now hand-tune through prompts or evals become internalized in the model’s weights. In domains like code, RL has already compressed intricate prompting into learned reasoning.
“Domain-specific RL reduces prompting . Those behaviors live inside the model now.” Justin
Iterate like a product team, not a research lab. The goal isn’t static accuracy — it’s a system that learns and improves with every deployment.
Think fast, think slow
Latency is now as important a constraint as accuracy. The best systems separate quick reflexes from deep reasoning — pairing a lightweight model for responsiveness with a more capable one for long-horizon thought.
OpenAI refers to this architecture as the Responder–Thinker pattern: a fast, real-time model handles dialogue flow and simple actions, while a slower, more capable model steps in for deep reasoning or multi-step planning. The result feels both instant and intelligent.
“Our real-time model isn’t the smartest, but paired in a responder–thinker pattern, the overall system is.” Justin
LangChain applies the same principle across broader workflows. Short feedback loops — a status update, a quick summary, a minor action — build user trust while heavier reasoning runs asynchronously in the background. Nick emphasized that the key is knowing when to add structure and when to stay flexible.
“You don’t want every decision to invoke full planning, just the ones that actually benefit from it.” Nick
Like the human mind, great agents need two modes of thought: one that reacts instantly, and one that reasons deliberately. Balancing the two is how they feel both responsive and intelligent.
Scaffolding is technical debt
Perhaps in a hot-take moment, the panel argued that as reinforcement learning continues to advance, most of today’s orchestration logic — the prompts, rules, and workflow scaffolding wrapped around models — will age into technical debt. What we call “structure” today largely exists to compensate for what models haven’t yet learned.
Justin pointed to Codex as an early example: after RL fine-tuning inside a coding harness, the model required only a fraction of its original prompt. Behaviors that once lived in orchestration logic migrated into the model’s weights.
“The more intelligence you have, the less scaffolding you need. Scaffolding exists to patch model deficiencies.” Justin
Nick agreed, but cautioned that not all rails are disposable. Structure still matters when correctness, predictability, or latency are non-negotiable. The skill lies in knowing when scaffolding provides stability, and when it’s just friction.
“Add structure only when the agent can’t complete the task or when planning churn makes it too slow or expensive.” Nick
As RL pushes more reasoning and control inside the model, orchestration itself begins to look brittle. Every explicit rule becomes tomorrow’s constraint. The future of agent design won’t be written in scripts — it will be trained into behavior, shaped by feedback rather than enforced by code.
Finding the right amount of structure
Following the discussion on scaffolding, the panel emphasized that structure still plays a critical role — especially when predictability, safety, or latency are at stake. Reinforcement learning may reduce the need for orchestration, but some guardrails remain essential.
Nick explained it pragmatically: add rails when the agent can’t complete the task, or when open-ended planning loops endlessly.
“Determinism isn’t a philosophy. I’s a performance optimization.” Nick
Justin agreed, noting that excessive structure can slow progress if it replaces what models could learn on their own.
“The more intelligence you have, the less scaffolding you need. Scaffolding exists to patch model deficiencies. If it’s 2+2, call a calculator. Don’t spend tokens reasoning what 2+2 is,” Justin
Ultimately, the panel described a balancing act: add structure when reliability requires it, remove it when the model can handle the task independently. The hardest design choice isn’t which tools to give an agent, but when to stop protecting it. Structure brings reliability early on, yet if it lingers too long, it constrains learning.
The Filesystem as Memory
As agents evolve from short-lived sessions to persistent systems, they need a stable substrate for memory — something they can read from, write to, and recover after failure. Increasingly, that substrate looks like a filesystem.
Both speakers highlighted the filesystem as a simple but powerful abstraction for storing agent memory. It naturally supports the behaviors agents need: writing ephemeral state, persisting long-term knowledge, and sharing information across users or processes.
LangChain uses this pattern to structure multiple layers of context — short-term session data, persistent project memory, and shared organizational state. Each layer lives as files the agent can open, modify, and reference as part of its reasoning loop.
“File systems are a really interesting medium for memory — ephemeral, multi-session, even shared across users.” Nick
This approach treats memory not as a database or black box, but as a readable and inspectable workspace — closer to how humans use notebooks or folders to organize thought. It also makes debugging and collaboration easier: other agents, or even humans, can trace what the system knows and how it learned it.
In practice, the filesystem becomes the agent’s long-term context — a space where reasoning leaves traces. It’s not the flashiest part of the stack, but it’s what turns a model into a persistent collaborator rather than a stateless function call.
Closing Thoughts
Across both panels, a clear shift emerged. The last generation of AI systems relied on scaffolding, such as prompts, workflows, and orchestration logic that constrained models into predictable behavior. The next generation is built for durability — agents that can remember, reason, and recover on their own.
