4 min read

Question: Is Aaru actually proving that synthetic research can predict human behavior and replace real user research?

Question: Is Aaru actually proving that synthetic research can predict human behavior and replace real user research?

I got asked some version of this a few days ago after the Wall Street Journal profile of Aaru started making the rounds. I had already been meaning to write about it, because I think this is exactly the kind of story that will get pulled into meetings, pasted into decks, and used as evidence that synthetic research has now been validated at scale. It has not. What the public evidence shows is narrower, weaker, and much more conditional than the coverage makes it sound.

Here is what happened.

A Wall Street Journal profile of Aaru made the rounds a few days ago. The hook was that EY used Aaru's synthetic research platform to replicate its 2025 Global Wealth Research Report in one day, achieving a median correlation of 90% across 53 single-select questions, work that normally takes six months of fieldwork. That number is real. The conclusion people are drawing from it is not.

A 90% correlation on survey replication is not proof that a system can predict human behavior. It is proof that a system can approximately mirror what people say they would do on a questionnaire. Those are not the same thing. Behavioral research has spent decades documenting the gap between stated intentions and actual behavior. That gap is not a footnote. It is often the entire finding. EY's own writeup quietly admits as much when it says it cannot claim the system guarantees what people will actually do. That sentence is doing a lot of work and it is buried.

To be fair, EY presents two cases where the simulation outperformed the survey. I am going to give them both.

The first is the inheritance case. The human survey said 82% of heirs would keep their parents' financial advisor. Aaru predicted 43%. Actual industry retention is somewhere between 20 and 30 percent. EY frames this as the simulation winning because it was closer to reality than the survey. And sure, 43 is closer to 25 than 82 is. But the model was still off by 13 to 23 percentage points on a question that drives real client retention strategy.

The second case is cleaner. On whether high-net-worth individuals prefer a single financial provider, the survey said 69%. Aaru predicted 37%. Reality is 33%. That one Aaru essentially nails. Credit where it is due.

But here is the problem. Two cherry-picked examples from one study do not constitute validation of a general-purpose behavioral prediction system. One case where the model was still materially wrong, one case where it got close, that is a start, not a proof point. You do not ship a strategy based on two highlighted examples from a white paper written by a financially involved partner. EY is now using Aaru to replace traditional surveys for some of its clients. That is a significant operational commitment to rest on a benchmark this thin.

The mechanism deserves more scrutiny than it gets in either piece. EY describes synthetic agents built from demographic, behavioral, and sentiment inputs (census data, transaction records, social media sentiment) that model decision-making behavior in simulated environments. On the public evidence, that is a sophisticated persona-construction engine. It generates agents that look plausible and behave coherently because they are built from text and data representing how people are described. Coherent and predictive are not synonyms. A system can produce outputs that feel right without being validated against what your actual users do in your actual context. That failure mode is not hypothetical. It is the default risk of any system built this way.

The WSJ piece deserves its own moment here. Before the EY result even lands, the article spends considerable real estate on the rage room, the co-founder's bedroom doubling as a conference room, Diplo investing over Swiss beers at Davos, bar mitzvah money funding a crayon company, and one founder spending a single night at Dartmouth before dropping out. That is founder mythology. It primes the reader to root for these kids before a single methodological question gets asked. By the time the EY validation claim arrives, you are already emotionally inside the story. That framing does real work.

The WSJ piece is part of the problem. Not because it is dishonest, but because it supplies credibility without supplying pressure. It gives readers a polished founder story, cites EY as third-party validation, briefly acknowledges skepticism from Coca-Cola, and moves on without asking what the benchmark standard actually proves, whether survey replication generalizes to behavioral prediction, or how wide the gap is between two highlighted examples and the claim that AI can replace primary research. That structure is how weak validation becomes conventional wisdom. A startup makes a large claim. A financially involved validator endorses it with impressive-sounding numbers. A major publication frames it as a breakthrough. The article gets screenshotted into a deck. Three months later someone in a planning meeting says "well, EY validated this."

None of that means Aaru is useless. A fast synthetic system that mirrors survey response patterns could be genuinely valuable for directional testing, stimulus screening, or early-stage scenario generation. That is a real use case. It is also a much more modest description than "predicts human behavior better than humans can."

The problem is not the tool. The problem is the claim. And the bigger problem is that the claim is now traveling through the industry with WSJ and EY logos attached to it, which means a lot of research teams are about to have a very uncomfortable conversation about why they still need real users.

They do. Two good examples from one survey replication study do not change that.

🎯If this is the kind of thing you want to read once or twice a week, subscribe to The Voice of User. No fluff. No PR. Just the things the rest of the field is too polite to say.