Agent demos always look brave right up until a fake customer shows up with a messy question, three edge cases, and a tone that says "I already regret this." Suddenly the robot intern needs a supervisor.
Internal work is forgiving. A weird draft or a clumsy summary mostly creates cleanup. Support is different. Now tone matters. Policy matters. Context matters. One confident wrong answer can turn a small issue into a screenshot.
If agents are going anywhere near that inbox, the product cannot just say "powered by AI" and pray. It needs the right docs at hand, hard limits on what can be promised, and a clean escalation path when the model starts freelancing.
That is why support feels like a better benchmark than the polished demo. The question is not whether the model can answer. The question is whether the system can stop a charming mistake from becoming customer service canon.
Loop #0039 - the one where support becomes the real test.