8 min read

Synthetic Users Are Dead. Long Live "Survey Prediction." Same LLM. Same Problem.

Synthetic Users Are Dead. Long Live "Survey Prediction." Same LLM. Same Problem.
Photo by Igor Omilaev / Unsplash

Something is happening that I want to name before it calcifies into a trend.

The synthetic users pitch is collapsing. I just wrote about a 182-study systematic review that documented the failure across nine domains. The outputs are shallow, stereotypical, sycophantic, invariable, and dangerously believable. The companies selling synthetic users know this, or they should. Some of them seem to be reading the room. And when a product claim starts failing, companies don't usually admit the product doesn't work. They rebrand.

What I'm starting to see, and I want to be clear this is still early, is a pivot. The framing is shifting from "we simulate your users" to something closer to "we predict survey outcomes." Same LLM. Same architecture. Maybe a different wrapper. But the pitch changes. Instead of generating synthetic participants who respond like your users, the new claim is: give us a survey, and we'll tell you what people would say, faster and cheaper than fielding it.

I don't have a list of companies doing this yet. It's more of a vibe shift in the demos and the positioning. But I think this is going to become a thing, and I think the problems with it are already documented. They're just not being connected yet.

The demo that always works

Here's how this goes. You're a potential buyer. The vendor takes a well-known public survey. Something widely cited, maybe government data, maybe an industry benchmark, maybe something from Gallup or Pew. They run it through their system. They show you the results next to the published human results. The correlation is 85, 90 percent. Impressive.

You sit in that meeting thinking you're looking at prediction. You're looking at recall.

I think the strongest public example of this is still the Aaru/EY case I wrote about a few weeks ago. Aaru replicated EY's 2025 Global Wealth Research Report with a median correlation of 90% across 53 questions. The coverage treated it like a breakthrough. But the EY Global Wealth Research Report has been published online for years. Prior editions, methodology, findings, all of it, sitting in the training data. A 90% correlation on a document the model has almost certainly read is not a prediction. It's a lookup.

Why surveys

The original synthetic users pitch was broader. Simulate interviews. Run synthetic usability tests. Generate qualitative data. That's where the failure was most visible, and that's what the 182-study review mostly documents. LLMs producing shallow, stereotypical, invariable outputs is hard to hide when the output is supposed to be a 30-minute interview about someone's lived experience with a product. The qualitative stuff fell apart fast because there was nothing in the training data specific enough to fake it convincingly.

Surveys are different. There are decades of publicly available survey data online. Gallup, Pew, YouGov, the Cooperative Election Study, Eurobarometer, government census data, industry benchmark reports. Thousands of surveys, millions of responses, all indexed and searchable and sitting in LLM training corpora. That's a massive base of prior answers the model can draw on to approximate new ones.

If you're a company whose synthetic users product isn't holding up, surveys are the obvious pivot. They're the one research format where the LLM has the deepest well of prior data to pull from. The demos will look better. The correlation numbers will be higher. And the buyer probably won't ask why.

Why it works on public surveys

The systematic review has a section on overfitting and contamination that explains the mechanism. If the study you're trying to replicate was already published online, there's a likelihood that the data was used to train the LLM. Access to this information inflates algorithmic fidelity for in-distribution inputs without effective generalization to out-of-distribution inputs. Their words, not mine, but the translation is simple: the model performs well on things it's seen and poorly on things it hasn't.

The review found that benchmark overfitting is a specific and growing problem. Models are being tuned to project higher human likeness along with better cognitive capabilities. Future models may show increasing alignment with established experiments and psychometric inventories as a result of being trained on those benchmarks. The appearance of improvement is itself a distortion.

Hämäläinen et al. (2023) captured this in miniature with what they called the "Journey bias." Synthetic participants asked to name an artistic video game overwhelmingly picked Journey. A decade-old game, massively over-represented in online discourse about games as art. Human responses were varied and showed recency bias. The model wasn't predicting taste. It was regurgitating the most statistically prominent answer from its training data.

Scale that up from a single question about video games to a 53-question wealth management survey that's been publicly available for years, and 90% correlation starts to feel less like intelligence and more like a very expensive ctrl+F.

Where I think it breaks

The review found something that I think matters a lot for this pivot. Fine-tuning with real results taken from an identical task but a different context can have a negative impact on generalization. So even if the model nails Survey A because it's seen the data, running it on Survey B about a related topic with a different population can make performance worse, not better.

The studies that used genuinely unseen data found different results. The cleanest example in the review is a study using Romanian gen-Z travel experiences, data the model hadn't encountered during training. That's a more reliable test of actual simulation capability.

And the Aaru case has its own version of this. On 52 of the 53 questions, the system matched well. On the one question where the answer almost certainly wasn't in the training data, what heirs actually do with their parents' financial advisor, the model was off by 13 to 23 percentage points. The real retention rate is somewhere between 20 and 30 percent. Aaru predicted 43. The human survey predicted 82. Aaru was closer than the survey, which EY used as a selling point, but 43 is still wrong enough to misguide a retention strategy if you trusted it.

That one question is the whole story. On the 52 questions where the answers were publicly available, the model looked great. On the one question that required actual prediction, it didn't.

Maybe I'm reading too much into a single case. But I don't think so.

There's an independent analysis that makes the same point from a different angle. Adam Kucharski, an epidemiologist and data scientist, ran an exploratory experiment using Claude Haiku to predict publicly available YouGov survey results. He open-sourced the code, which is more transparency than most vendors offer. The results are interesting. On well-covered topics like daily driving patterns and favourite romcoms, the model did well. On less-covered topics like specific advertising parents worry about or awareness of women's rights figures, accuracy roughly halved. The mean absolute error was around 5% for the easy questions and about 11% for the harder ones.

Kucharski noted in a comment that the surveys were conducted after the model's training cutoff, so the model hadn't seen those specific results. Fair point. But the model doesn't need to have seen that particular YouGov poll to approximate the answer. Whether British people prefer tea over coffee has been documented in public data for decades. The 2025 result is predictable from the 2020 result. The 2015 result. The performance gradient still tracks with how well-covered the topic is, not with temporal novelty.

To his credit, Kucharski is transparent about this. His own conclusion is that you should ask yourself "what could a research analyst reasonably deduce from public sources?" rather than treat AI as a black box that can predict behaviour. That framing is honest. It's also a description of retrieval, not prediction. And it's exactly the question the vendors are not encouraging their buyers to ask.

The pattern across all of this is consistent: LLMs perform well on things that are well-represented in training data and fall apart on things that aren't. If the entire survey prediction pitch is built on demos using public surveys, then the pitch is built on the one scenario where contamination is most likely.

This pattern has been documented in peer-reviewed work since 2023. Sanders, Ulinich, and Schneier, published in Harvard Data Science Review, ran 56,000 synthetic responses through GPT-3.5 and compared them to the Cooperative Election Study. On well-established policy issues like abortion bans and SCOTUS approval, ideological correlation was above 85%. On the war in Ukraine, a topic that arose after the model's training data was collected, the model failed. It predicted strong liberal opposition to U.S. involvement, applying anti-interventionist patterns from the Iraq era. The actual survey showed near-uniform support across the political spectrum. The model couldn't generalize to a topic it hadn't absorbed. That was two years ago. The vendors pitching survey prediction today are still running their demos on public data.

A test no one is running

Here's what I want to see someone do. It's simple.

Take an internal survey your organization has already fielded. Something that has never been published. Not the results, not the questions, not the methodology. A private survey on a topic specific enough that the model couldn't have encountered equivalent data during training.

Run it through one of these survey prediction systems. Compare the results to your actual human data.

If the system hits 85-90% correlation on private, unpublished data with the same fidelity it shows on public benchmarks, then I'm wrong and I'll write about it. If it collapses, which is what the review's contamination findings predict, then every demo you've seen using public survey data was measuring recall, not prediction.

This is not a complicated experiment. Any organization that has fielded a proprietary survey in the last two years has everything they need to run it. The fact that nobody seems to be doing this, or at least nobody is publishing the results, is itself informative.

I understand why the vendors don't run this test. What I don't understand is why the buyers aren't demanding it.

The rebrand doesn't fix the mechanism

I want to be fair. I don't know the internal architecture of every company pivoting in this direction. Maybe someone has figured out something the 182 studies in the review haven't. Maybe there's a layer on top of the LLM that genuinely improves out-of-distribution prediction. I haven't seen evidence of it, but I'm open to being surprised.

What I'm not open to is the framing. Renaming synthetic users as survey prediction doesn't change what the underlying system can and cannot do. If it works because the LLM has seen the data, then the product is a very expensive way to look up answers that are already public. If it works on private data too, then show us. Run the test. Publish the results.

Until someone does, the 90% number is marketing. And the pivot from synthetic users to survey prediction is a costume change, not a capability change.

I keep coming back to something the systematic review said in its conclusion. LLMs predict which words are most likely to come in sequence. If the sequence was already in the training data, the prediction will look good. That's retrieval. And no amount of rebranding changes the underlying mechanism.

If you've run one of these systems against private survey data, hit me up. I genuinely want to know what happened. Especially if it worked. That would be a much more interesting article to write.

🎯 If you want the no-BS read on what's actually happening in UXR before it shows up in a vendor deck, subscribe to The Voice of User. One or two posts a week. No sponsors. No pitches.