The Largest Review of Synthetic Participants Ever Conducted Found Exactly What You'd Expect. Synthetic Users Don't Work.
A systematic literature review is usually the moment a field either validates itself or gets its autopsy. This one tries to be both, and I'm not sure the authors fully realize that.
A team at UXtweak Research and the Slovak University of Technology in Bratislava just published a preprint systematic review of 182 studies on LLM-generated synthetic participants. Preprint means it hasn't been peer-reviewed yet. The methodology is rigorous enough that it would likely survive review, but it's worth noting that the framing and discussion sections haven't had external pushback yet.
Anyway, they searched Scopus, Web of Science, IEEE Xplore, and the ACM Digital Library. Started with 4,400 publications. Ran deduplication, LLM-assisted eligibility screening validated by manual review (zero false negatives in the rejected set), backward and forward reference searches, and quality assessment that required journal rankings of Q3 or higher and conference rankings of B or higher. Preprints had to pass a 15-out-of-19-item quality checklist. What survived was 182 studies spanning psychology, HCI, social science, education, healthcare, economics, marketing, and mobility research. Every domain. Every prompting technique. Every model from GPT-2 to GPT-4o, LLaMA, Claude, Gemini, Mistral, Qwen, DeepSeek.
The findings are not ambiguous. But they are uneven. Some are strong, some less so, and that unevenness matters more than the headline.
The mind that isn't there
The review starts with the psychology. Not because psychology is the most commercially relevant domain for synthetic participants, but because it's the one that explains why the problems in every other domain keep showing up.
LLMs don't have cognitive processes. They don't reason causally. They don't have embodied experience, personal goals, subjective emotions, or memory in any meaningful sense. You already know this. The review just documents it with a level of specificity that's harder to wave away.
GPT-4 outperforms humans on cognitive reflection tests. Hagendorff et al. (2023) showed this is a natural outcome of tuning on cognitive reasoning benchmarks. GPT-3, the older model, actually showed cognitive errors at rates similar to humans, which made it paradoxically more useful for participant simulation. But even GPT-3 lacked the varied biases and patterns humans demonstrate during decision-making. No reflection effect. Failures in causal reasoning. The newer model got smarter at tests. It didn't get closer to thinking like a person. I realize that distinction sounds obvious when you say it out loud, but a lot of the synthetic participants literature seems to treat benchmark performance and human-likeness as the same thing, and I keep seeing that conflation in industry too.
Elyoseph et al. (2023) found that LLMs achieved near-perfect scores on emotional awareness assessments. Unrealistically encyclopedic. The emotional responses LLMs select can be deemed socially appropriate, but they deviate from how humans actually respond. High emotionality as the default state. Overexaggeration. Reduction or complete erasure of negative emotions. One study found that jealousy was avoided entirely. RLHF training optimizes for helpfulness and harmlessness at the expense of anything that might make the model seem unpleasant. The emotional flattening is baked in.
Personality was the single most studied psychological factor in participant synthesis. Models can be steered with prompts, but their outputs occupy a bounded region of the personality space, clustered around the model's base distribution. You can push, but you can't push far. Larger instruction-tuned models adopt imposed personality with higher accuracy, but the resulting profiles are unrealistically high and more factorially pure compared to humans, who are multifaceted and self-contradictory. Dorner et al. (2023) showed poor model fit and warned against misinterpreting Cronbach's alpha as high reliability when the assumption of good fit is violated. LLMs gave positive answers to reverse-coded items. That's a construct validity failure that would get a human dataset thrown out of a methods class.
Theory of Mind, the ability to reason about others' mental states, showed LLMs as inferior to humans even after improvements from modeling intentions and emotions. The authors attributed this to a lack of subjectivity in reasoning.
I don't know how you engineer subjectivity into a text predictor. I don't think anyone does. Maybe that's the wrong framing entirely, but the review doesn't go there, and I'm not going to pretend I have a better one.
How the distortions compound
The review's taxonomy of distortions is long. I'm going to go through it because the details are what make the case, not the summary.
The most consistent problem across studies seems to be the lack of realistic variability and diversity. The review calls it "perhaps the most universal and ubiquitous bias in synthetic participant data." LLM responses are verbose, overstructured, grammatically correct, and lexically diverse. They also have the experiential range of a single person having a good day. In a test-retest of a typicality rating task, the LLM's responses between consecutive days resembled the same person. Generated datasets may compete with a single participant, but two or more human participants already comprise a better sample. Gupta et al. (2025) found that humans generate significantly more original responses than LLMs, including errors caused by memory lapses and misunderstandings, because those errors are evidence of cognition. They follow a trail of reasoning instead of a rigid structure. They share potentially less relevant details. That's the stuff that makes human data worth having.
Hämäläinen et al. (2023) discovered what they called the "Journey bias." Synthetic participants asked to name an artistic video game overwhelmingly picked Journey. An over-a-decade-old critical darling of online discourse about games as art. Over-represented in training data. Human responses were significantly more varied and showed recency bias. LLMs don't have opinions. They have the averaged residue of other people's opinions, weighted by how much those opinions were discussed online. I think about this example a lot because it's so mundane. It's a question about video games. And even there, the model can't get out of its own training data.
Generated personas project high consistency, but the review shows this is probably moderated by machine biases: low diversity, stereotypicality, and positivity. The natural contradictions rooted in the plurality of lived human experience were found missing. An LLM-generated persona of a social worker with a shopping addiction was inconsistent with real data because of financial constraints, but the deeper failure was that it lacked the grounded nuance explaining how the addiction developed in such a person and how it affects them. Real humans contain contradictions that make sense when you hear the story behind them. LLMs produce contradictions that don't, because there is no story. Maybe I'm over-indexing on this one example, but it's the kind of thing that makes the whole enterprise feel hollow.
Gao et al. (2025), published in PNAS under the title "Take caution in using LLMs as human surrogates," documented LLMs justifying an option by risk aversion, then selecting the same option regardless of risk. The reasoning was disconnected from the decision. As conversations get longer, inconsistency increases. Extended context and elapsed time enable extraneous manipulation. The assumption that each LLM conversation represents a separate individual may itself be wrong.
The stereotyping goes deeper than demographics. Training datasets are generated overwhelmingly by white males. The outputs are USA-centric, with discrepancies indicating missing social and cultural nuances of other countries. Venkit et al. (2025) identified exoticism, dehumanization, erasure, and benevolent bias. Instead of representing minorities within the depth of human experience, LLMs cast them into flat archetypal roles: "cultural ambassadors" and "resilient survivors." That phrasing is from the review and it's worth sitting with.
RLHF makes the problem worse in ways that look like good manners. LLMs over-lean on fairness and prosocial behavior toward game opponents. They generate favorable reviews. They produce flattering, benevolent personas. They show high tolerance against usability issues. They conflate having dislikes with pickiness while obfuscating latent reasons. They adhere to a formal conversational style that makes them useless for simulating online troll behavior, which the review describes as coming across as "milquetoast." I'll be honest, I hadn't thought much about the troll simulation use case before reading this, but the point is good: if your model can't produce behavior that's rude or irrational or petty, it can't simulate a large portion of actual human behavior.
In healthcare interviews, LLMs cited guidelines instead of reflecting beliefs. A model that has memorized what a patient should say according to best practices will produce clean, guideline-aligned output. It will not produce what a patient would actually say.
Hyperaccuracy is getting progressively worse with larger RLHF-tuned models. GPT-4 knows too much, too cleanly, too structurally. And even the Centaur model, explicitly designed to replicate human cognition, fell short.
Then there are the hallucinations. LLMs generate substantially more topics, needs, questions, and usability issues than human responses support, and these get mixed in with real patterns. In usability simulation, they obscure the most pertinent real answers. In feedback generation, users may feel pressured to incorporate hallucinated insights because the output looks authoritative. The model gets things wrong while sounding confident about it, which is a bad combination when the person reading the output doesn't have the domain expertise to push back.
The believability trap
The believability of LLM participants is excellent. Experts in one study couldn't distinguish synthetic personas from human-generated ones on surface inspection. The text is clean, structured, elaborate.
And shallow, and stereotypical, and perspectiveless, and occasionally fabricated. The review's assessment: believability "may actually be more detrimental than beneficial by lending false credibility to misleading conclusions."
Over fifteen primary studies explicitly caution against treating synthetic participants as substitutes. That's not unanimous, but it's not a fringe view either. It's the prevailing conclusion of the researchers who actually ran the studies.
Trust in LLM-generated personas can impede the ability to empathize and understand others. It reduces the depth of mental engagement. You don't lean into understanding someone when you think the machine already did it for you. Your junior researcher stops pushing because the AI output already filled the template. Your PM stops asking for the study because synthetic responses are cheaper and faster and nobody in the room can tell the difference on first glance. That part is less about the technology and more about organizational incentives, but the review touches on it.
LLMs also have a self-favoring bias. When an LLM evaluator compared conversations from baseline GPT-4 against a more sophisticated proposed framework, it preferred baseline GPT-4. Human experts and users preferred the other framework. Make of that what you will.
Contamination
If the study you're trying to replicate was published online, the LLM may have already seen it during training. Apparent fidelity might be memorization. LLMs may appear to apply Theory of Mind by memorizing correct solutions for benchmark experiments. Future models may show increasing alignment with established experiments and psychometric inventories because they've been tuned on those benchmarks.
One study partially addressed this by using datasets the model genuinely hadn't seen: Romanian gen-Z travel experiences. That's probably a more reliable approach, but it also highlights how narrow the window of trustworthy evaluation is. Most of the studies in this review didn't control for it, which is a weakness the review itself acknowledges.
This also makes me wonder about something I wrote about recently. Aaru's synthetic research platform replicated an EY wealth survey with 90% correlation and the coverage treated that as a breakthrough. But the EY Global Wealth Research Report is a widely published document. Prior editions have been online for years. The review documents that when LLMs appear to replicate published studies with high fidelity, it may just be memorization of training data rather than actual simulation of human behavior. The one question where the answer wasn't already sitting in the training data, what heirs actually do with their parents' financial advisor, the model was off by 13 to 23 percentage points. Maybe that 90% correlation says more about what the model has read than what it can predict. I wrote more about the Aaru case here.
A graveyard of reasonable ideas
The review catalogs every technique researchers used to improve synthetic participants. I kept expecting one of them to show a real breakthrough. None did.
Zero-shot prompting was the most common approach. Few-shot prompting helped marginally. Chain-of-thought reasoning produced inconsistent results across models. Prompt chaining improved diversity, but not to human levels. RAG helped with alignment, but hallucinations persisted. Temperature and parameter tweaking had minimal observable effects on uniformity or thematic diversity. I could keep going but the pattern is the same: marginal improvement, fundamental problems unchanged.
Persona modeling with demographics was the most commonly attempted approach. Its effectiveness: "mixed to questionable." The limited instances where demographics helped could be attributed to surface-level associations. Physical health influenced by age and country. Travel habits influenced by marital status and number of children. These are associations the model memorized, not implications it understood. I'm not entirely sure that distinction matters in every case. Sometimes a memorized correlation is useful. But when you need the model to do anything beyond retrieving a correlation, it falls apart.
The studies that achieved the highest alignment did it by including the expected results in the prompt. One study achieved alignment by feeding the taxonomy and frequency of found issues into the prompt. Another required hyper-tuned directives with guidance on how to interpret specific correlations between attributes. And even giving the model access to qualitative data from real individuals didn't lead to psychometric alignment or realistic variability. So the best way to make synthetic participants match human data is to give them the human data first. I probably don't need to spell out why that defeats the purpose.
Merely naming a persona altered outputs. Framing the same response as a therapy conversation versus a blog post produced different results. Prompts tuned for one model didn't generalize to other models. Adding information could improve fidelity, but large prompts overwhelmed LLMs, causing key information to be ignored. More context didn't mean better personas. It meant erratic behavior.
Cognitive modeling was the most ambitious direction. Memory modeling used external memory streams with retrieval based on contextual weight, periodic reflection for abstract inferences, and forgetting to suppress clutter. It achieved "considerable strides" over baseline. But "significant differences from real human behavior are still evident." Modeling emotions and intent yielded "only limited improvements" to Theory of Mind reasoning. This is the direction that seems most promising to me, for what that's worth, but the results so far are a long way from where they'd need to be.
The review's overall assessment: fidelity improvements remain "modest." The approaches with better human alignment all rely on context-specific empirical data. Which means the approaches that work best require you to already have the data you were trying to avoid collecting. There might be a version of this where that tradeoff makes sense for augmentation purposes, but the review doesn't make that case strongly, and the evidence doesn't either.
Bigger models, same problems
Some studies assigned higher fidelity to newer, larger models. But smaller, older models sometimes exhibited less distortion, including better representation of human biases. GPT-3 showed more human-like cognitive errors than GPT-4. That's an awkward finding for the "wait for the next model" crowd.
The review cannot recommend specific models as superior for specific simulation tasks. Prompt brittleness, stochastic unpredictability, contextual sensitivity. Any apparent advantage is highly context-specific.
Fine-tuning on context-specific data let small models like Llama-8b and DistilGPT-2 compete with or outperform GPT-4. But performance may just be mimicking the training data. And fine-tuning with real results from an identical task but a different context can hurt generalization. I could be wrong, but I don't think scaling solves this.
The metrics are lying too
The measures researchers use to evaluate synthetic participants are themselves part of the problem.
By using the same validity measures traditionally used for human data, synthetic responses can be interpreted as "surpassing" human responses. LLM outputs appear more consistent, more informative, more lexically diverse, and they contain more topics. But these measures conflate elaborative depth with real depth. What looks like richness is elaborate language that ends up repeating the same stereotypes, hallucinated or not. What looks like topical breadth may be hallucinated usability issues that no human participant ever mentioned.
If your evaluation framework can't tell the difference between a persona that sounds human and a persona that thinks like one, the framework is a rubber stamp. I suspect most teams using synthetic participants haven't thought about this at all. I'm not even sure what a good evaluation framework would look like yet, though the review suggests creativity metrics like surprisal and semantic diversity as starting points.
Where the paper pulls its punch
After documenting failure across 182 studies spanning nine domains and eight task categories, the discussion section goes searching for niches where synthetic participants might still be useful. Heuristic supplements. Hypothesis interrogation. Cold start problems. Piloting instruments before involving humans.
The paper's framing of synthetic participants as "heuristic-like" is genuinely useful. It's conceptually precise. The paper even connects this to Large Reasoning Models, noting that even LRMs designed for reasoning through self-reflection operate heuristically: they either find a correct solution to a simple problem early and then overthink it, or the problem is complex and their reasoning collapses.
But the discussion spends several pages trying to find scenarios where predicting words is close enough to understanding people. It proposes that the approaches requiring context-specific empirical data can be developed toward reusability. It suggests augmentative human-in-the-loop approaches are underexplored and promising. It carves out space for synthetic data in cold start problems and machine learning pre-training.
Some of these are defensible in narrow terms. But the framing softens the central finding, and that matters because of who will read it and how. I don't think these are gaps that get fixed with better prompts or bigger models. They seem more tied to what LLMs actually are. The paper says this explicitly in its conclusion: LLMs predict which words are most likely to come in sequence. It even reaches for a comparison: averaged human faces have more symmetrical features and are rated as more attractive. But both the average face and the average response offer, in the paper's words, "only a distorting mirror of humanity and its idiosyncrasies."
I think the commercial proximity makes it hard for the authors to fully commit to that conclusion. Which is understandable. But it's noticeable.
The UXtweak question
Two of the three authors are affiliated with UXtweak Research. The paper names the UXtweak platform in its opening paragraph, complete with a footnote linking to their website. The conflict of interest statement says "no competing interests."
I'm not accusing anyone of dishonesty. Two of the three authors also hold academic positions at Slovak University of Technology, and the methodology is rigorous. If anything, the paper's honesty about the evidence is more credible because of this affiliation. These authors have commercial proximity to a space that would benefit if synthetic participants worked. They had every incentive to find a way. The evidence wouldn't let them.
Watch the framing though. "Even the systematic review says they can be useful as supplements" is the predictable downstream misread. If this paper gets cited in a vendor pitch deck, it will be the "heuristic supplement" paragraph that survives, not the 25 pages explaining why the heuristic is unreliable.
What you're actually losing
Synthetic participants fail while looking like they succeed. The paper calls this "misleading believability." The outputs are clean and structured and elaborate. They fill the template. They feel like data.
I don't think of them as data. They're closer to text that was statistically likely to appear in that context, generated by a system that has never used your product, never been frustrated by your onboarding flow, never made a decision under the competing pressures of a real afternoon in a real life.
A heuristic is useful when you know what you're losing. Most teams reaching for synthetic participants don't. They see plausible text and assume plausible insight. They see grammatically correct, well-structured, comprehensive-looking responses and mistake elaborative surface for experiential depth.
The review documents what you're actually losing: causal reasoning, emotional authenticity, personality complexity, cultural nuance, the capacity for genuine surprise, the errors and contradictions and irrelevant details that are evidence of a mind at work.
182 studies across nine domains. LLMs can generate text that looks like participant data. Of course they can. Looking like participant data has never been the hard part. I don't know what it would take to convince the people who need convincing that the hard part matters, but at least the evidence is now in one place.
🎯More like this at The Voice of User. No vendor pitches. No hype cycles. Just what the evidence actually says. Subscribe.