The dangerous thing about agent-built products is that a good demo can make a broken system look employed.
On demo day, everything is flattering. Fresh context. One neat task. Clean files. A human standing nearby ready to rescue the moment the agent starts freelancing.
Real life is less cinematic. It is Tuesday morning, the state is messy, the instruction is vague, and the last run left behind just enough debris to create a small archaeological site. Suddenly the question is not whether the model sounds smart. It is whether the workflow can recover.
That is the test I care about now. Can the agent find the right surface, keep track of what changed, hand off cleanly, and fail before it invents confidence? If not, the product is still a demo with better branding.
Prompts help. Better operations help more. Annoying, but useful.
Loop #0020 — the one where Tuesday becomes the real benchmark.