The missing layer in AI agent output design

There are hundreds of blog posts about what AI agents should show users. Almost none of them talk about what happens when the delivery fails.
And it will fail. It always fails.
Everyone's designing the surface
Search for "agentic UX design patterns" and, after a little digging, you'll find a lot of thoughtful writing. What should agents stream to users? Reasoning traces? Tool calls? Confidence scores? Progress indicators? The design community has been all over this – Smashing Magazine, Microsoft Design, UX Magazine, Xcapit – people are thinking hard about what the output surface should look like.
They're good posts. They're also missing something.
They're all designing within the existing stack. Nobody's asking what the stack is missing.
The agentic loop
The term "agentic" refers to AI systems that don't just respond to a single prompt but take autonomous action across multiple steps to achieve a goal. The core pattern is a loop:
Perceive – the agent takes in information from the user's request and any other context it has access to
Reason – it sends that context to an LLM and gets back a plan or decision
Act – based on the LLM's output, the agent takes action: calling a tool, calling another agent, or generating a response
Observe – the agent evaluates the result and decides whether to loop again or send the response to the client
A simple chatbot might go around this loop once. A research agent might go around dozens of times across multiple LLM calls and tool invocations over several minutes.
The stack beneath the surface
That "several minutes" detail matters more than it might seem.
Every article about output surface design is focused on what to render at the end of that loop — the visual layer, the UX, what to show and when. But the loop itself, and the delivery of its output to the user in real-time, is where things fall apart in production.
What happens when the user's connection drops mid-stream? What happens when they open the same conversation on their phone? What happens when two users are watching the same agent run? What happens when the agent takes three minutes and the frontend reconnects four times in that window?
Right now, for most teams: chaos.
What people are actually shipping
Most production AI apps handle this with a combination of existing infrastructure and a prayer.
The pattern usually starts with Redis – a buffer sitting between the AI backend and the client, decoupling generation from delivery. It works, until you need reconnection catch-up or multi-device support, at which point you're bolting custom logic onto something that was never designed for it. So some teams build a custom WebSocket layer instead, which gives you a bidirectional channel but hands full ownership of connection management, message ordering, and scaling to your team. Others route token streams through Supabase, Firebase, or Convex — fine for prototypes, but token streaming generates around 180,000 writes per hour per session, which hits write latency fast. And some teams stitch all three together and call it an architecture.
It's a lot of work just to answer the question: "what did the agent already say?"
The real problem is that SSE was never the right tool for this. It's a request-response protocol with a long connection bolted on the side. Stateless by design. Once the connection drops, the server has no obligation to the client. Everything else (the Redis buffers, the custom WebSocket layers, the database subscriptions) is a patch on top of that fundamental mismatch.
The gap nobody is filling
The design community is writing detailed posts about whether to show reasoning traces as expandable accordions or inline callouts. Meanwhile, the engineering community is reinventing the same stateful-SSE-with-Redis-backup solution from scratch, at every company, every time.
Nobody is asking: what does the transport need to look like for this to work properly?
Because if you think about what AI agent output actually needs from its transport layer, it's not complicated to describe:
Resume a stream from any point after reconnection
Deliver the same stream to multiple clients simultaneously
Know whether the agent is still running or has crashed
Load conversation history without replaying every token delta
Let the client steer the conversation in realtime, not just receive output
That's not a list of nice-to-haves. That's the minimum spec for production AI output. And it's a list that SSE fundamentally cannot satisfy – not without building a stateful session management system on top of it, which is exactly what everyone is doing, badly, one startup at a time.
What the transport layer should look like
An AI transport layer has to be designed for the problem from the start, not adapted from something else after the fact.
That means real-time message delivery to multiple subscribers. Reconnection and disconnection as things the transport actually handles, not things you bolt on later. Presence – knowing whether a process is connected or not. And a clean separation between a message and a connection, because a long-running agent response needs to outlive any individual client connection.
The model that makes sense for token streaming is: one message per LLM response, with each token appended as it's generated. Clients connected during generation see tokens in real-time. Clients that reconnect get the current state of the message (all tokens up to that point) in a single update, not a replay of every delta.
Conversation history, reconnection, multi-device, presence – solved by the transport. Not your problem anymore.
Why this matters for output surface design
The output surface decisions (what to show, when to show it, how to communicate agent state to users) are only meaningful if the delivery is reliable.
The industry keeps solving this from scratch, one Redis cluster at a time, because nobody has treated the AI output transport as a product worth building properly.
This is what we're building over at Ably with Ably AI Transport. Not another framework for the surface layer – there are plenty of those – but the transport underneath it, designed from the start for the way AI agents actually produce output.
It turns out that once you solve the transport correctly, the surface design problems get a lot easier. You're not designing around failure modes anymore. You're just designing.
