Stanford and Columbia Tested User Digital Twins Against Real Survey and Interview Data. They Found the Same Thing. They Kinda Work (with a lot of caveats).
Two papers on user digital twins updated within days of each other. Two research teams with opposite incentives just landed on the same empirical picture, and almost nobody is connecting the dots.
Stanford's Park et al. posted v2 of the "1,000 people" paper. The original title was Generative Agent Simulations of 1,000 People. The new title is LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals. Notice what got removed. The number. The scale pitch. The implication that this is a thing that was done. The new title replaces all of that with a softer verb: "enable." LLMs grounded in self-reports enable simulation. Not replicate. Not replace. Enable. Reviewers almost certainly pushed back on the original framing.
Columbia's Toubia et al. posted v5 of their mega-study. The old title was A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement. The new title is Digital Twins as Funhouse Mirrors: Five Key Distortions.
Read those title changes together. Stanford started with a bold scale claim and softened. Columbia started with balanced methodological framing and sharpened into a thesis. Both teams spent months revising. Both ended up at titles that are more honest about the limits than the originals were. The convergence is not just in the numbers. It's in the language.
The academic papers both say "digital twins." I'm going to say "user digital twins" throughout, because that's how vendors pitch this to your research team, and the vendor framing is the one most readers here are actually encountering. The underlying technology is the same.
What I keep thinking about is how little of this is actually in dispute between the two teams. Both papers now use similar task families. Both papers compare twin-based predictions against various baseline agents. Both papers report correlations, accuracies, and normalizations. If you strip away the framing and stack the numbers next to each other, the academic story has quietly converged. The vendor pitch built on top of these papers has not.
User Digital Twins Are Not Synthetic Users
I keep seeing these two conflated in vendor decks, in Slack threads, and in leadership conversations where someone read a TechCrunch piece and now wants to know if the team can stop running interviews. They are not the same thing. The distinction matters for how you evaluate both categories and for what arguments land when leadership pushes you to adopt one or the other.
A synthetic user is an LLM prompted with a rough demographic or persona description and asked to behave like a participant. You type in "35-year-old working mom in Ohio who uses your product twice a week" and the model produces whatever text it thinks such a person would say. There is no real person behind the output. There is no grounding in any specific individual's actual behavior, history, or prior responses. The entire epistemic weight of the output rests on the LLM's averaged sense of what someone matching that description might plausibly say. The 182-study review I wrote about documented why this falls apart. The outputs are shallow, stereotypical, hyper-rational, positivity-biased, and dangerously believable.
A user digital twin is different in one specific way. The twin is built from a real person's actual prior data. Their interview transcript. Their survey answers. Their personality inventory responses. Their choices in economic games. The premise is that all this individual-level grounding will push the model past the averaged demographic stereotype and into something closer to actual individual simulation. You have a real participant. You have their data. You use that data to ground an LLM that then responds to new questions as if it were that person.
The theoretical case for why twins should work better than synthetic users is reasonable. If the problem with synthetic users is that they collapse into stereotypes, then giving the model 128,000 characters of that specific person's actual responses should, in principle, prevent the collapse. Columbia tested exactly this premise. Stanford tested a similar version with two-hour interview transcripts. Both of them loaded the model with the richest possible individual grounding anyone has attempted at scale.
Here is what makes the distinction matter for practitioners. When a vendor says "our synthetic participants match your users with 85% accuracy," they are almost always describing a twin, not a synthetic user, because twins are the version that produces demo-ready numbers. Twins beat synthetic users on every benchmark. The synthetic users pitch is mostly collapsing, as I wrote about a couple of weeks ago. The twins pitch is ascending because it has better numbers. Your job as a researcher is to understand that "better numbers than a failing baseline" is not the same as "good enough to replace your research."
Both categories share the same underlying mechanism. Both are LLMs producing text. Both are subject to the retrieval-not-prediction problem. Both are distorted by RLHF in the same ways. The twins just have more context stuffed into the prompt, which gives them more to retrieve from. Whether that counts as a fundamental improvement or a scaled-up version of the same problem is exactly what the two papers I'm writing about set out to test.
That's the promise. If it works, it changes the discipline. You could do heterogeneity analysis on a population you already profiled. You could pilot survey instruments before fielding them. You could stress-test a research plan against a known panel. The academic excitement is not crazy. These would be real wins.
The question is whether the promise survives contact with the evidence. Two papers just told us it mostly doesn't.
What the H-E-Double-Hockey-Sticks These Papers Actually Tested
Both papers ask the same question from different angles. Can a large language model, given enough information about a real individual, predict how that individual will respond to new questions the model has never seen before? That's the test of whether user digital twins work as research tools.
The differences are in how each team set up the test.
Stanford recruited 1,052 American adults, stratified to approximately match the US population on age, race, gender, region, education, and political identity. Each participant sat through a two-hour voice-to-voice interview conducted by an AI interviewer following the American Voices Project script, which covers life history, family, work, health, politics, religion, and personal values. The average transcript came out to roughly 6,500 words per person. Participants then completed the General Social Survey, the Big Five personality inventory, five behavioral economic games with real money on the line, and five replications of published social science experiments. Two weeks later, participants took the same batteries again. That second wave gave Stanford a test-retest baseline: how consistent are real humans with themselves across two weeks?
Stanford then built five types of agents for each participant. Interview-only agents trained only on the transcript. Survey-only agents trained on the GSS and Big Five responses. Combined agents trained on both. Demographic agents given only age, race, gender, and political ideology. Persona agents given a self-written paragraph describing themselves. The test was whether the agents could predict that specific person's held-out responses, compared against the person's own two-week consistency.
Columbia took a different approach. They recruited 2,058 participants via Prolific and had each person answer over 500 questions across four waves of surveys, averaging 2.42 hours per person. The questions covered 14 demographic items, 279 personality items across 19 personality tests, 85 cognitive ability items, 34 economic preferences items, 48 questions across 17 heuristics-and-biases experiments, and a 40-item pricing study. That's approximately 128,000 characters of self-report data per participant, which is roughly five times the grounding material Stanford used per person.
Columbia then designed 19 pre-registered sub-studies covering 164 outcomes. These were not replications of existing social science findings. They were new studies covering things like attitudes toward hiring algorithms, intention to share misinformation, fairness judgments, creativity evaluations, and privacy preferences. Columbia also included what Stanford did not: a comparison against the base LLM with no individual grounding at all. If you wanted to know how much of the twin's performance came from the individual data versus from the model's general knowledge, Columbia was set up to answer that. Stanford was not.
Both teams report multiple metrics for each result. Accuracy, which is how often the twin matches the human exactly. Correlation, which is how closely the twin's answers track the human's answers across a range of questions. Normalized accuracy, which adjusts for how inconsistent humans are with themselves. Mean absolute error for continuous outcomes. Columbia adds dispersion analysis, which measures whether the variability in the twins' answers matches the variability in the humans' answers.
The reason these two papers are worth reading together is that they are both testing essentially the same underlying claim with substantially different amounts of individual grounding, substantially different task domains, and substantially different baselines for comparison. If user digital twins actually work the way vendors claim they do, the numbers should converge on "yes, they work." If twins don't really work, the numbers should converge on "no, they don't." What actually happened is messier than either extreme, and it's what the rest of this piece is about.
What Stanford Actually Reported
Park et al. v2 reports that their user digital twins predict participants' General Social Survey answers with 83% normalized accuracy for interview-only agents, 82% for survey-only agents, and 86% for the combined version. The weaker baselines they tested (agents given only demographics, or a short persona paragraph) came in at 74% and 71%. These are the numbers you'll see in vendor decks for the next two years.
Here's what normalized accuracy actually means, because almost nobody gets this right. The raw accuracy of the twins was 65.67%. Stanford divided that by how consistently the real participants replicated their own answers two weeks later, which was 79.53%. 65 divided by 79 gives you 83%. The framing is: twins are 83% as accurate as a person answering the same question twice. That's defensible math. It's also cognitively loaded, because the reader sees 83% and mentally files it as "very accurate" when the underlying hit rate is in the mid-60s. By the time the number reaches an executive deck, the nuance is gone.
A practitioner reading the paper casually will stop there. A practitioner reading the supplementary materials will find something very different.
Stanford tested the twins on five behavioral economics games: the Dictator Game, the Trust Game in two versions, the Public Goods Game, and the Prisoner's Dilemma. These are the tasks most relevant to how people actually behave when money is on the line. They are the closest thing in the paper to a test of whether twins can predict behavior rather than attitudes.
The main text reports a correlation of r=0.66 for economic games. The supplementary appendix reports the individual construct correlations: r=0.11 for the Dictator Game, r=0.08 for the Trust Game (first mover), r=0.03 for the Trust Game (second mover), r=-0.05 for Public Goods, r=0.10 for Prisoner's Dilemma. Average: roughly r=0.05. That's near zero. A correlation of 0.05 means effectively no relationship between the twin's prediction and the individual human's choice.
Both numbers are real. They measure different things, and the difference matters enormously.
The r=0.66 is an across-persons correlation. It's calculated by taking the average behavior of all humans in a given game, the average behavior of all twins in that game, and checking whether the patterns line up. Yes, they do, because the model has read hundreds of academic papers describing how people behave in the Dictator Game. It knows that dictators typically split between 20% and 50%. It knows that trust-game first movers often send a little. It knows the general shape of the distribution.
The r=0.05 is an across-constructs correlation within individuals. It asks: given this specific person's actual choices across all five games, can the twin predict their specific choices? No. Not even close. The twin knows how humans in general behave. It does not know how this particular human behaves.
For a research buyer, this is the difference between a tool that replicates population-level findings from published studies and a tool that predicts what a specific person on your panel will do. The first is useful for teaching. The second is what vendors are pitching. Stanford's own appendix shows the second isn't happening.
The part of the paper I find most revealing is the mechanism analysis, and I don't think Stanford realizes how much it gives away. They tested where the twin's accuracy actually comes from. They classified each GSS question as either directly retrievable from the interview transcript (the person literally said something that answers the question) or inferable from adjacent facts (the person said something nearby that implies the answer). When Stanford removed the 100 questions most likely to be directly retrievable, twin accuracy dropped from 83% to 80%. When they removed the 100 most likely to be inferable, it dropped from 83% to 77%. The demographic and persona agents stayed flat in both conditions, which is what you'd expect because they don't have interview data to retrieve from.
Translation: the twin's entire advantage over a plain demographic baseline is explained by the model looking up answers in the interview transcript or making adjacent inferences from what the transcript says. There is no deeper individual modeling happening. The model is not building a theory of the person. It is doing a very fast, very fluent ctrl+F across 6,500 words of transcript and guessing the rest from context.
Their fine-tuning experiment closes the loop on this interpretation. Stanford fine-tuned GPT-4o on 500 agents' actual GSS answers and tested it on the remaining 552. If the model were learning something transferable about individual-level prediction, fine-tuning on that task should improve performance. It didn't. The fine-tuned model performed slightly worse than the non-fine-tuned one with the original prompt. This is what you'd expect if the model had already extracted everything useful from in-context retrieval and further training was adding noise.
I think Park et al. are genuine about the methodological advance. I don't think they're hiding anything. But the paper is selective about which findings make the abstract, and the findings that make the abstract are the ones that support the headline number.
What Columbia Actually Reported
Toubia et al. v5 ran 19 pre-registered studies across 164 different outcomes. Each user digital twin was built from roughly 128,000 characters of that person's prior responses. That's about 25,000 words per participant, or the equivalent of a short novella of personal data grounding the model. Substantially more than Stanford's interview transcripts. If individual grounding scales, Columbia's twins should be the best-performing twins anyone has produced. They weren't.
Average correlation between twin and human responses across all 164 outcomes: r=0.20.
For readers who haven't worked with correlation numbers in a while, here's the translation. A correlation of 1.0 means perfect agreement. A correlation of 0 means no relationship at all. Correlations in psychology are considered small at 0.10, medium at 0.30, and large at 0.50. An r of 0.20 is in the small-to-medium range. It means there is a real but weak relationship between what the twin says and what the actual person says. The twin is doing better than guessing. It is not doing anywhere close to well enough to stand in for the person.
The second finding is the one that should stop practitioners in their tracks, and it's the one I can't believe more people aren't talking about.
Columbia compared the standard deviation of twin responses to the standard deviation of human responses across all 164 outcomes. The twin responses were less variable than the human responses in 154 out of 164 cases. That's 93.9%. The difference was statistically significant in 146 of them. Twins give answers that cluster more tightly than the humans they are supposed to be simulating. They do this almost universally, and they do it to a degree that is statistically detectable in almost every single outcome Columbia tested.
Here's why this matters. Most product decisions are not made on averages. They are made on the shape of the distribution. You care about the 15% of your users who are at risk of churning. You care about the 8% of power users who drive revenue. You care about the vocal minority who will complain publicly about a change you're about to ship. A simulation that systematically under-represents variance cannot see those tails. It gives you a population that looks like your users on average and is nothing like your users at the edges, which is where all the interesting product decisions happen.
Stanford does not report this metric. Columbia does. That absence in Stanford's paper is itself informative. When two research teams measure the same task family and only one of them measures whether the distribution matches, ask why.
Columbia organized their findings into five distortions. Each one has a specific mechanism behind it, and each one will show up in different ways depending on what your team uses twins for.
Stereotyping. Columbia compared full twins (trained on all 500 questions of individual data) against twins trained on demographics only. The full twins produced answers closer to the demographic twins than to the actual humans they were modeling. All that individual grounding — the personality tests, the economic games, the cognitive ability tests, the 279 personality items — did not pull the twin's outputs meaningfully away from what a demographic stereotype of that person would say. The rich grounding was not rich. The model was still defaulting to the archetype.
Insufficient individuation. The 93.9% under-dispersion finding. Twins cluster around the model's sense of what a person matching this description would typically say. They do not branch out into the idiosyncratic, the contradictory, or the unexpected ways real humans answer questions.
Representation bias. Columbia broke performance down by demographic subgroup. Twins were more accurate for participants who were more educated, higher income, more ideologically moderate, and more moderate in religious attendance. This is not a coincidence. These are the populations that are maximally represented in the text data LLMs are trained on. If your product serves a population that skews younger, lower-income, more politically extreme, or more religiously devout than the LLM's training distribution, your twins will be less accurate for your actual users than the headline numbers suggest. The 182-study review found this pattern across dozens of studies. Columbia is the first to measure it cleanly on a single large panel.
Ideological bias. Twins displayed coherent value structures that real humans do not share. The clearest example Columbia gives is that twins appeared to be simultaneously pro-technology and pro-human, which is a combination that real humans tend to resolve one way or the other. Twins resolve ideological tensions more neatly than people do because the model has been trained to produce internally consistent text.
Hyper-rationality. Twins behaved more rationally and displayed more knowledge than the humans they were simulating. This one is worth pausing on because it explains a lot of the other findings. Large language models are trained in two stages. First, on raw internet text. Second, on a process called RLHF, which stands for Reinforcement Learning from Human Feedback. RLHF is the step where human raters teach the model which kinds of responses are preferred. It's how LLMs learned to be helpful, polite, and rational. It's also why they cannot simulate the parts of human behavior that are rude, petty, inconsistent, or confused — which, if you've ever actually run a user interview, is most of the interesting parts. The twins are not failing to simulate real humans because the researchers did something wrong. They are failing because the underlying model has been trained specifically against the behaviors that make humans human.
Columbia's overall conclusion is that twin performance is only modestly better than a base LLM given no individual information at all. The rich grounding delivers a small lift. It does not deliver a transformation.
Where the Two Papers Actually Agree
Once you get past the adjectives, four findings survive in both papers.
Rich individual grounding gives a small absolute lift over cheap baselines. Stanford reports a 9-point gap between interview agents and demographic agents. Columbia reports that twins are "only modestly more accurate" than the base LLM. Both are describing the same gap from different framings. A 9-point gap on a task where random baseline is already inflated by demographic skew is not the same as individual-level simulation. It's a real effect and a small one.
The individual-level signal is driven by retrieval and inference, not by psychological modeling. Park et al.'s own mechanism analysis shows this. Columbia's stereotyping and insufficient-individuation findings show it from the other direction. The model is looking things up and making adjacent inferences. It is not building a theory of the person.
Twins compress variance. Columbia measures this directly. Park et al. never report it. The silence is informative. When two papers measure the same task family and one measures dispersion while the other doesn't, ask why.
Economic games are where twins fail hardest. Columbia's hyper-rationality finding predicts exactly what Park et al.'s supplementary table shows. The correlation across constructs is near zero. The across-persons correlation in the main text captures that twins know how the average human behaves in a Dictator Game, because that behavior is documented in hundreds of training-data papers. What they don't capture is how specific individuals diverge from that average, because that information is not in training data in any form the model can address.
When two research teams with different incentives report the same numbers, the numbers are the story.
What This Means for the Research Discipline
This is where I think the discussion needs to move, because most of the commentary on these papers is stuck arguing about whether twins work. That's the wrong question. The right question is what twins are actually good for and what they quietly corrode when misused.
I'll say what I think is genuinely unlocked first.
Heterogeneity exploration on a panel you already have. If you have a group of research participants you've interviewed or surveyed in depth, and you want to pre-test a new instrument against them, a twin-based approach will probably tell you something useful about which questions are likely to produce variance and which ones will flatten. Not because the twins are accurate, but because even their limited variance reveals where the real humans are likely to split. Both papers support this use. Columbia explicitly frames it as one of the few surviving applications.
Piloting and instrument design. If your survey is going to produce mostly the same answer from everyone, the twins will show you that before you field. This is not prediction. This is diagnostic. A twin that gives you the same answer from every simulated participant is telling you something about your question, not about your population.
Cold-start exploration when you have literally nothing. This is weaker ground. The risk is that the model produces plausible-looking text that substitutes for the actual exploration a junior researcher would otherwise do. If you're a mature research team, this is mostly a waste. If you're a team that would otherwise do no research at all, it might beat zero. Maybe. I'm not confident in that claim and I keep going back and forth on it.
Now the part where I think twins actively hurt the discipline.
Anything involving incentivized decisions. Economic games, willingness-to-pay, churn, conversion, switching behavior. The twins are hyper-rational. The humans are not. If you're using twin predictions to estimate how people will respond to a price change or a dark pattern, you will systematically underestimate the emotional, inconsistent, petty, distracted behavior that actually drives those decisions. I've seen teams do this kind of modeling with synthetic users and present the outputs to a pricing committee. Nobody in the room had the domain knowledge to push back. The pricing decision got made on the model's output.
Anything involving minority populations. Columbia's representation bias finding is specific and measurable. If your product serves a population that is not maximally represented in LLM training data, twins will be systematically less accurate for that population. You will get cleaner-looking data that is worse. I cannot stress enough how much this is the worst possible failure mode for a research tool.
Anything involving novel stimuli. The 182-study review covered this in depth. Park et al. and Toubia et al. confirm it from different angles. Twins work best on variations of well-documented stimuli and fail on anything genuinely new. If your research is about something new, which is most research that actually matters, twins are exactly wrong for it.
Anything that depends on capturing the tails. The under-dispersion finding is the one I think practitioners need to internalize most. Twins smooth. The smoothing looks like cleaner data. The smoothing is the failure.
What a Research Team Should Actually Do
If you're on a team where leadership is pushing digital twins as a cost-reduction play, you have a narrow set of arguments that will land.
Ask for the dispersion comparison. Not the accuracy number. Not the correlation number. The standard deviation of the twin predictions versus the standard deviation of the human responses, measured on the same task. If the vendor cannot produce this, they are not measuring the thing that matters for any decision your team will actually make.
Ask for performance on private, unpublished data. I've written about this before and I'll keep writing about it. If the demo uses a public survey, the demo is measuring retrieval. Internal survey data from the last two years, fielded on your actual product, with questions that don't exist in any training corpus, is the only meaningful test.
Ask for the economic-games or incentivized-behavior correlations specifically. If the vendor only shows attitude-question correlations, they're showing the friendliest task. Any behavioral task with real stakes will perform worse.
Ask which subgroups the system is most and least accurate for. Both papers now document representation bias. If the vendor hasn't measured it on their specific system, they either don't know or don't want to say.
If you're a research leader trying to decide where twins fit in your org, I'd say something like this. Twins are a real methodological advance for certain narrow uses, most of which are internal to the research craft rather than replacements for human data collection. They are not ready to be the front-line participant pool for substantive decisions. The commercial pitch has gotten ahead of the academic evidence by a meaningful margin, and the two papers that just updated are narrowing that gap from the academic side.
The Kinda in Kinda Works Is Doing a Lot of Work
I opened this piece saying user digital twins kinda work. I want to close on what "kinda" is actually carrying.
Kinda means: they produce outputs that correlate modestly with human responses, most consistently on topics heavily represented in training data. Kinda means: they beat a random baseline by a real but small margin. Kinda means: they give you something useful for piloting instruments and exploring heterogeneity on a panel you already have. Kinda means: the lift over simpler methods is genuine.
Kinda also means: they systematically under-represent variance. They distort minority populations. They act more rationally than humans do. They cannot predict novel stimuli. They correlate with individual behavior at r=0.20 across the broadest test anyone has run. They are better at retrieving what people like this usually say than at predicting what this specific person would say.
Both lists are true. Stanford is reading the first list and concluding twins are promising. Columbia is reading the second list and concluding twins are funhouse mirrors. The lists do not contradict each other. They are the same technology viewed through different thresholds for what counts as working.
Vendors will pick one list. Probably the first one. The 85% number will be in decks for years. The r=0.20 number will not. The damage happens in the gap between "they kinda work" and "they work," and that gap is where research teams get restructured, budgets get cut, and product decisions get made on data that was smoothed into uselessness before anyone could see the smoothing.
If you remember one thing from this, let it be that both Stanford and Columbia agree twins do something real. They just disagree about whether "something real" is enough for what you're about to use them for.
🎯 If you want the no-BS read on what's actually happening in UXR before it shows up in a vendor deck, subscribe to The Voice of User. Writing on AI-augmented research that doesn't stop at the abstract. No vendor pitches. No hype cycles. Just what the evidence actually says. Subscribe.
Studies referenced:
- Park et al., LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals (v2, April 2026): https://arxiv.org/abs/2411.10109
- Peng, Gui, Brucks, Toubia et al., Digital Twins as Funhouse Mirrors: Five Key Distortions (v5, April 19, 2026): https://arxiv.org/abs/2509.19088v5