Some Thoughts on Studying Agentic Systems: The Hard Part Is Invisible
An agentic feature looks like the easiest thing in the world to study. It's a feature. We have a playbook for features: watch some people use it, ask how it went, score the experience, write the deck. The playbook runs fine on agents, the data comes back clean, and that is exactly the problem, because clean data is also what a failed agent study looks like.
The methods in the standard UXR toolkit share one assumption so basic nobody bothers to state it: the person we study is the person who decides. Interviews, usability sessions, surveys, conjoint, diaries. Every one of them instruments a human at or around the moment of choice, and fourteen years in, I'd say most of what I know how to do amounts to instrumenting one side or the other of that moment. When a user delegates the decision to an agent, the moment relocates. The person writes a few sentences, a model reads them, and the model chooses while the user is somewhere else entirely. Point the playbook at the user and you are carefully measuring someone who has left the room. The study still runs. The participants still talk. Nothing errors out. You just get findings about the wrong object.
Nothing errors out
The default moves fail in specific ways, and none of them fail in a way you'd notice. Ask users whether they're satisfied with what the agent picked and they'll tell you, sincerely, because they never see the choice they would have made themselves and have nothing to compare against. Interview people about their preferences and you'll capture the ones they can articulate, which excludes precisely the heuristics that drive a surprising share of real behavior, since nobody can report a bias they don't know they have. Recruit from people who use the agent feature, the obvious screener, and you've sampled on the outcome. Run the study on the one model your team ships, at whatever temperature the SDK defaults to, and you've measured an artifact with a shelf life nobody wrote on the deck. Track aggregate behavior and you can watch an agent look perfectly reasonable on average while doing something very different to individuals.
Every one of those studies produces a deck. Every deck looks fine. I ran versions of several of them before I had language for what was missing.
What gave me the language is a working paper I have not been able to put down, which is a strange thing to say about a behavioral pricing study involving gas stations.
An existence proof involving gas stations
Kraft and Larsen's "Consumer Preference Transmission in Agentic Markets" (SSRN, June 2026) asks the question underneath all of this: when a person hands a purchase to an AI agent, how much of that person survives the handoff? They recruited about 1,800 people on Prolific and ran them through a conjoint, 25 head-to-head choices between gas stations differing on brand, price per gallon, and extra minutes of driving. Gasoline was deliberate, a standardized, boring product where the tradeoffs are simple, which makes it a conservative test: if preferences fail to transmit even here, they will not fare better where tastes are messier. The clever part sits inside the price. They varied the dollars and the cents independently, which lets them measure left-digit bias for each individual, the degree to which someone underweights the cents and reads $4.99 as meaningfully cheaper than $5.00. That heuristic sits behind every .99 price you have ever seen, and it makes a perfect test case because nobody knows they have it. A person can tell an agent they prefer Chevron. Nobody tells an agent to please underweight the cents component of prices.
Then the handoff. Each person wrote a natural-language prompt for an AI agent that would buy on their behalf, and seven models across the OpenAI, Anthropic, Google, and DeepSeek families, all pinned at temperature zero, faced the same 25 choice sets under each person's prompt. That produces two full sets of preference estimates per person, one from their own choices and one from their agent, and transmission is simply the correlation between the two across people. Brand and distance preferences came through at around 0.2 to 0.3, credibly above zero for every model, and before anyone calls that success, sit with the number: 0.25 is mostly loss. Even the preferences people can name and type only partially survive. Left-digit bias did not survive at all, a pooled correlation of 0.03 with intervals covering zero for every single model. And the agents are not unbiased, which is what makes the result interesting rather than just disappointing. Every model exhibited left-digit bias of its own, in roughly the same range as humans. They just don't carry yours. The agent overwrites your relationship with a .99 ending the way a translator with strong opinions renders every author in the same voice. Where human and agent biases did align, it happened through a side door: the people most willing to delegate had higher bias to begin with, sitting close to where the models already were. Nothing transmitted. They matched before writing a word.
Now run the standard playbook against that same scenario in your head. Satisfaction interviews with agent users come back positive, the agent picks cheap, sensible gas stations. A study of adopters finds people whose behavior aligns nicely with their agents, because that's who adopts. A quality review of the agent's choices finds them reasonable. Three clean studies, three fine decks, and the overwriting sits underneath all of them, undetected. The only design that catches it is the one they actually ran.
Five moves worth stealing
The first move makes everything else possible: the human is their own control condition. The same person decides directly, then writes the prompt, then the agents decide on identical tasks, which means delegation finally has a counterfactual. Most existing work on whether models behave like people compares a sample of humans against a separate run of models, and that design can only speak about populations. It cannot tell you whether your preferences survived your delegation, and that, not the population question, is what anyone shipping an agent needs to know.
The second is treating the prompt as data, and specifically its form, not just its content. They coded length, numeric detail, whether cents got mentioned, and it turns out people with stronger left-digit bias write shorter prompts with less price information in them. The bias leaks through the shape of the writing, which means a latent trait becomes measurable without anyone articulating it. That is a qualitative coding job, and it is sitting in the middle of an econometrics paper.
The third is the decomposition I expect to steal in meetings for years. For a preference to transmit, two things have to hold: it has to leave a trace in the prompt at all, which they call identifiability, and the model has to act on the trace, which they call responsiveness. Transmission is effectively the product of the two, so either failing takes the whole thing to zero. From the outside both failures present identically, the agent did something the user didn't want. Inside, they belong to different disciplines. Identifiability is a question about people and their language, studiable with no model whatsoever, which makes it squarely ours. Responsiveness is an eval. If the signal never entered the prompt, no amount of model tuning fixes it, and if the model ignored a signal sitting right there, no amount of coaching users to write better prompts fixes that either.
The fourth is keeping aggregate behavior and individual transmission apart. "The agent is biased" and "the agent carries this user's bias" sound like the same finding on a slide, and they support opposite product decisions. The agents here are biased as a group and carry zero of any particular person's bias, which is a sentence the aggregate metrics would never have produced.
The fifth is honesty about the instrument. Seven models, four families, temperature zero, all dated, because the differences between models are themselves a result. They also admit something agent eval decks routinely dodge: there is no prompt-free baseline. Even an empty instruction is a prompt that conditions the model, so you never observe the model on its own, only the model under some input, and your baseline is a choice you made rather than a fact you found. Same spirit in how they report. Their follow-up study put real stakes on adoption, an incentivized task where people chose whether to use an agent to buy an actual book, and the link between agent use and the bias landed at p = 0.118. They print it, call it directionally consistent, and decline to dress it up as more. A paper this comfortable with its own loose ends is rarer than it should be, in their literature and in ours.
Meanwhile, in our research plans
Hold those moves next to current practice and the gaps are not subtle. The text artifact users now produce constantly, the prompt, sits in product logs largely uncoded, while we maintain repositories full of coded interview quotes. We have an entire qualitative tradition built for exactly this and we are not pointing it at the one object that now mediates the decision. The triage version of the same blind spot: an agent miss gets filed as a model problem, and the cheap question is whether anyone checked the prompt for the signal first. In my experience, roughly half the time, nobody has.
Recruiting deserves its own uncomfortable paragraph. When we study an agent feature, we screen for people who use the agent feature, because of course we do, that's who has the experience to talk about. The paper's selection result is what that habit costs: the trait that fails to transmit is the same trait that predicts who delegates, so the people who already resemble the model are the ones handing it the wheel, and a screener built on usage filters the sample by the very thing the study should be measuring. The non-adopters and the reluctant are not an edge segment in this kind of work. They are the control group, and they are the people our screeners are built to exclude. I might be over-indexing on this one, but I don't think by much.
The hard part was never the instruments
None of this is expensive, which took me a while to absorb. The whole design is a panel, a choice task, a prompt-writing step, and some API calls at temperature zero, and any team that has ever fielded a conjoint owns every piece. The paper's statistical machinery is elaborate, but even a crude person-by-person comparison of what the human picked against what their agent picked tells you whether preferences are surviving the handoff.
So the difficulty was never tooling or budget. The difficulty is that agentic systems hide where the decision went, the old designs keep returning answers anyway, and a study that measures the wrong object is indistinguishable from a good one right up until someone runs the comparison. Agents take over more of the choices inside our products every quarter, and most research plans still describe a person standing at the shelf, deciding. I notice nobody is in a hurry to update them, including the people who write blog posts about it.
🎯 Subscribe for more of this. It's the rare delegation where your preferences actually survive the handoff.