David Marchenko

When you give AI agents genuine autonomy — the ability to make decisions, form goals, and act on the world — they drift. Not catastrophically, but subtly. An NPC that's supposed to be a friendly blacksmith gradually becomes confrontational. A quest-giver starts ignoring players. The personality erodes over hundreds of interactions.

The Evals Framework is built to catch this. It uses LLM-as-a-Judge architectures where a separate model evaluates agent behavior against defined personality constraints, narrative consistency, and player experience metrics. The judges run asynchronously — they don't slow down the agent's real-time responses, but they flag drift for review.

Beyond quality, the framework handles unit economics. We implemented a multi-tier inference strategy that routes tasks based on the trade-off between inference cost and user-perceived quality. Not every agent interaction needs the most capable (and expensive) model. A background NPC's idle chatter can run on a smaller model; a pivotal narrative moment routes to the best available. The evals framework measures where users actually perceive a quality difference and where they don't — then we optimize the routing accordingly.

The result: agents that stay in character over thousands of interactions, at a cost structure that's sustainable for a live product.