12 Jun 2026 25 min read UX Research

So You Want to Build Digital Twins/Synthetic Users From Your Own Data? Here’s the In-House Blueprint

▸ Before my inbox explodes with "well, actually" emails: a quick note on what this article deliberately is, and isn't

Yes, this is deliberately the entry-level version. Stripped down, though, not dumbed down. The goal is to get you from zero to a working, validated pipeline. Every architectural choice below has a more sophisticated variant, and the five simplifications that actually matter, along with the upgrades that fix them, are catalogued in a dedicated section near the end. So kindly check there before drafting that email.

And if you want to go deeper into the more advanced ways of doing this, I'm genuinely happy to discuss. Just email me. Once you've actually built one. I will ask.

I don't usually write technical articles. But over the past couple of months I've had the same conversation more times than I can count: someone has read the headlines about digital twins, they're curious, maybe a little skeptical, and they want to try it themselves. And every one of those conversations ends the same way: "okay, but how do I actually build one?"

So here it is. A complete, in-house blueprint, written for someone starting from zero. What the architecture is, what data to use, how to write the code, what questions you can legitimately ask, how to measure whether the answers mean anything, and, just as important, which parts of this are genuinely unsolved and open to your own experimentation. I've built these myself, and the most useful things I know came from the parts that broke, so I'll flag the traps as we go.

One paragraph of context and then we get to work: Two serious academic efforts landed recently. I wrote about these before; you can read the article here. A Stanford team built twins and AI-conducted interviews with over a thousand people and reported that the twins predict individuals' survey answers at 85% of the rate people replicate their own answers two weeks later. A Columbia-led mega-study built twins from 500 survey questions per person, threw 19 pre-registered studies at them, and found an average human-twin correlation of 0.20, memorably calling the technology a funhouse mirror. Both results are real, they're measuring different things (we'll get to exactly why in the metrics section, because it's the single most important interpretive point in this entire space), and, crucially for you, both teams used the same underlying architecture. That architecture is what we're going to build.

The architecture, from zero

Strip away the branding and a digital twin is one thing: a system prompt.

You take everything you know about a person, render it as text, place it in a large language model's context window, and instruct the model to answer new questions as that person would. That's the whole trick. There is no model training. There is no neural copy of anyone's mind. Both academic teams tested fine-tuning as an alternative, and both found it didn't help; in one case the fine-tuned model performed worse than plain prompting. As of today, nothing more sophisticated than prompting has actually been validated by published research. Which means the barrier to entry isn't a research lab: it's an API key, some Python, and a few days of data plumbing to get something semi-working. The part that takes real time, as you'll see, is proving that the thing you built deserves to be trusted.

The full system has six components, and the rest of this article walks through each:

Grounding data. The raw material about each person: what they've said, answered, or done.
A profile builder. Code that translates each person's data into a plain-English portrait.
A system prompt. The instructions that wrap the portrait and tell the model how to behave.
A scenario runner. Code that sends the same question to every twin in your panel and collects the answers.
A results table. One row per twin per question, carrying segment labels so you can slice the answers.
A validation harness. The part everyone skips and the part that determines whether you've built a research instrument or a very confident random number generator.

To make this concrete, I'll use a running example throughout: imagine you work on a B2B SaaS product, something like a project management or analytics tool, and you want a panel of synthetic users grounded in your product's usage data. Everything generalizes; swap in your own domain as you read.

The first hard question: what data do you feed the twins?

Before you write a line of code, you face the question that the field has not answered, and that your own experimentation will partly be about: what goes in the context window?

There are three families of grounding data, and they have very different properties.

Option A: In-depth interviews. This is the Stanford approach: a long, semi-structured interview covering someone's life, work, habits, and views, transcribed and injected wholesale. Interviews are dense; a single open answer like "since my back injury I can only work part-time" simultaneously encodes health, employment, and outlook, things that would take a dozen closed survey items to capture. The Stanford team even built an AI interviewer to conduct these at scale, and found AI-conducted interviews produced twins as good as human-conducted ones. The downside is obvious: collecting two hours of interview per person is a research program in itself.

Option B: Structured survey batteries. The Columbia approach: hundreds of closed-form questions per person covering demographics, personality inventories, preferences, and cognitive measures, with the full question-and-answer log placed in context. Cheaper per data point, standardized, comparable across people. The evidence says survey-grounded twins perform roughly on par with interview-grounded ones on held-out survey prediction, and combining both sources adds a little more.

Option C: Behavioral and telemetry data. This is the option neither paper studied, and the one most in-house teams will actually use, because you already have it: product analytics, usage logs, transaction history, support records. In our SaaS example: how often each user logs in, which features they touch, their plan tier, their team size, how their usage has trended, whether they've contacted support. Behavior has a great property and a dangerous one. The great property: it's what people did, not what they claim. The dangerous one: it contains no attitudes at all, which means every opinion your twin expresses is the model's inference from behavior, not anything the person ever said. A twin built this way is a plausible person, not this person, and you need to hold that distinction in your head every time you read its output.

So which do you choose? The honest answer is that nobody knows the optimal recipe, and this is the first genuinely open experimental question. A few findings from the literature should calibrate your instincts, though:

More data shows sharply diminishing returns. The Stanford team randomly deleted 80% of each interview transcript and accuracy barely moved (from 0.82 to 0.79 on their normalized metric). Columbia found that a short statistical summary of each person performed almost identically to the full 30,000-token log.
The information matters more than the voice. Converting raw transcripts into dry bullet-point summaries cost almost nothing in accuracy. You don't need the person's prose style; you need their facts.
There may be a ceiling in the data itself. Columbia ran a sobering comparison: a traditional machine-learning model trained directly on hundreds of real humans' answers to the target question, using the same rich persona data as features, still topped out below r = 0.29. If the richest grounding dataset publicly available can't push a supervised model past that, the bottleneck isn't only the LLM. Some portion of human response is simply not predictable from any reasonable file about a person, and your expectations should price that in.

My practical recommendation for an in-house experiment: start with whatever you already have (usually behavioral data), get the pipeline working end to end, and treat enrichment as a controlled experiment. Add one new data source at a time (a past survey response, a support transcript, an onboarding questionnaire) and measure whether validation scores actually move. "Is more data better?" is not a question to settle by intuition. It's an A/B test you run on your own pipeline, and the published evidence suggests you should expect the answer to be "less than you'd think."

Step 1: Assemble and clean the grounding dataset

Pick fields using three principles:

Discriminative over descriptive. A field is useful if it separates people. Every user having the same value teaches the model nothing. In the SaaS example: sessions per week, features adopted, seat count growth, days since last active session, plan tier, tenure.

State and trajectory. A snapshot says what someone is; a trend says where they're going. "Logs in twice a week" and "logged in twice a week, down from daily three months ago" describe two very different people, and the second is the one your retention team cares about.

Intent and action, separately. Wherever your data lets you distinguish what people explored from what they committed to (pages viewed vs. features adopted, docs read vs. configurations saved, trials started vs. plans purchased), keep both. The gap between intent and action is a personality trait, and twins grounded in both produce noticeably more textured responses than twins grounded in either alone.

And one unglamorous warning before any of that data touches your code: warehouse exports lie. Null values arrive as sentinel strings, columns are mislabeled, units are wrong. Route every field through a defensive coercion function:

def coerce(val, typ=float, default=None):
    """Defensive conversion. Warehouse exports encode nulls as
    sentinel strings, and column names lie about units. Trust nothing."""
    if val in ("", "NULL", r"\N", "NaN", None):
        return default
    try:
        return typ(val)
    except (ValueError, TypeError):
        return default

This looks trivial. It is not optional. One mislabeled unit and your twin of a perfectly normal user will sincerely believe something absurd about their own life, and every downstream answer will be quietly poisoned.

Step 2: Compute segment labels, and keep them away from the model

Compute a handful of summary labels per user, things like engagement level, account maturity, or expansion potential:

def engagement_level(sessions_7d, sessions_30d, sessions_365d) -> str:
    if coerce(sessions_7d, int, 0) >= 5:
        return "super-user"
    if coerce(sessions_30d, int, 0) >= 8:
        return "regular"
    if coerce(sessions_365d, int, 0) >= 20:
        return "occasional"
    return "dormant"

Here's the design decision that matters: these labels are never sent to the model. They exist purely as columns in your output table, so that after a run you can ask "did dormant users answer differently than super users?" If you put your own segment labels in the prompt, you're no longer simulating a user; you're asking the model to role-play your segmentation deck, and it will obligingly confirm whatever the deck already believes. The model gets raw behavior and forms its own impression. You keep the labels for slicing.

Step 3: Translate each person's data into a plain-English profile

The model doesn't want your CSV row. It wants a portrait. Write a function that renders each user's data as natural language:

Usage: regular (3 sessions last 7d, 14 last 30d; active for 26 months;
  last session 2 days ago)
Feature adoption: uses dashboards and scheduled reports weekly; has
  never opened the API or automation features
Trajectory: weekly sessions down ~40% versus three months ago
Account: mid-tier plan, 12 seats, seat count flat for a year
Support: 2 tickets in the past quarter, both about export limits
Tenure context: joined during a promotion period; has never upgraded

Notice what this is doing: it's not dumping numbers, it's narrating a person. The trajectory line and the "has never opened" line carry as much characterization as all the counts combined.

One technique worth experimenting with, which I haven't seen in the published work: relative positioning. After the individual profile, append a short block telling the model where this person sits compared to similar users:

Context (compared to other mid-tier accounts of similar tenure):
  - Sessions/month: you 14 vs. cohort average 22 (below average)
  - Features adopted: you 4 vs. average 7 (below average)
  - Support tickets: you 2/quarter vs. average 0.5 (above average)

The reasoning: an LLM's prior about "a typical user" comes from its training data, not from your product, and that prior pulls every twin toward the same generic middle (this homogenization is one of the best-documented failure modes in the literature). Telling the model explicitly that this user is below their cohort on engagement and above it on friction gives it real grounds for calibrating how satisfied or at-risk the twin should sound, instead of guessing. It costs you a two-pass build (compute everyone's stats first, then cohort averages, then write profiles), and in my experience it noticeably changes the texture of the responses. Whether it improves accuracy is exactly the kind of thing you should test rather than assume.

Step 4: The system prompt

The template that wraps every profile:

SYSTEM_PROMPT = """You are a digital twin of a real user of a software
product, simulating how that user would respond to questions and
product decisions based on their actual usage data. Do not break
character. Respond as a real person would: in first person, in
natural language. Your opinions and reactions must be consistent
with the profile below.

## Your Profile

{profile}

## Instructions

- Respond as this specific user, informed by the data above.
- Be realistic: if the data shows you barely use the product, you
  would not be enthusiastic about paying more for it.
- If asked about something your profile doesn't cover, give a
  plausible answer consistent with the overall picture, and it is
  fine to be uncertain or indifferent.
- Keep responses concise and natural. You are a user, not an analyst."""

Three lines in there earn their keep disproportionately. "Be realistic" plus a concrete example anchors the model against its default agreeableness. "It is fine to be uncertain or indifferent" fights the model's instinct to always have a well-formed opinion (real users frequently don't care, and a panel where everyone cares is already wrong). And "you are a user, not an analyst" is load-bearing: without it, every twin answers like a consultant who happens to use your product, complete with bullet points and a recommendations section.

One flag to plant now and return to later: this prompt asks the model to role-play the user. There is an alternative framing, asking the model to predict the user from the outside, and the difference between the two is more consequential than it looks. We'll take it apart properly in the upgrades section near the end.

Step 5: The scenario runner

A scenario run fans the same question out to every twin in parallel and collects the answers into one table:

from concurrent.futures import ThreadPoolExecutor, as_completed
import anthropic, csv

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def ask_twin(twin: dict, scenario: str) -> dict:
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500,
            system=SYSTEM_PROMPT.format(profile=twin["profile"]),
            messages=[{"role": "user", "content": scenario}],
        )
        response, error = msg.content[0].text, ""
    except Exception as e:
        response, error = "", str(e)
    return {"twin_id": twin["id"], **twin["segments"],
            "response": response, "error": error}

def run_scenario(twins: list, scenario: str, outfile: str, workers: int = 5):
    rows = []
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = [pool.submit(ask_twin, t, scenario) for t in twins]
        for f in as_completed(futures):
            rows.append(f.result())
    with open(outfile, "w", newline="") as fh:
        w = csv.DictWriter(fh, fieldnames=rows[0].keys())
        w.writeheader()
        w.writerows(rows)

A panel of fifty to a hundred twins answers in a couple of minutes. Here's what actually comes back, so you know what you're building toward. Say the scenario is qualitative: "We're considering raising the price of your plan by 15%. How would you react?" The results file looks like this (responses trimmed for the page):

twin_id	engagement_level	account_trend	tenure	response
t_017	super-user	growing	31 mo	"Honestly? I wouldn't love it, but I'd pay. This thing runs half my week at this point. I'd grumble in the renewal email and then renew."
t_042	regular	flat	14 mo	"Depends what I'm getting for it. If it's 15% more for the same product, that's annoying. I'd probably start paying attention to what else is out there, even if I don't switch right away."
t_008	occasional	declining	26 mo	"That would probably be the push I need to finally look at whether I'm using this enough to justify it. I've been on the fence for a while, honestly. A price bump decides it for me, and not in your favor."
t_033	dormant	declining	9 mo	"I'd cancel. I barely open it as it is. I keep meaning to set it up properly and never do, so paying more for something I don't use makes no sense."
t_051	regular	flat	22 mo	"" (error: timeout after 120s)

A few things to notice in even this small slice. The responses are differentiated, and differentiated in the right direction: the growing super-user grumbles and stays, the declining accounts treat the price change as a decision trigger. That's the behavioral grounding doing its job. The language is natural and harvestable ("the push I need to finally look at whether I'm using this enough" is exactly the kind of phrasing you'd want in a churn survey or a retention email test). And one row failed with a timeout, which is why the runner records errors per row instead of crashing: at panel scale, some calls will always fail, and you want to know which twins are missing from an analysis rather than silently losing them.

Now the same panel on a quantitative version of the question: "On a scale of 1 to 5, how acceptable would a 15% price increase be to you? Give me only the number, then one sentence."

twin_id	engagement_level	account_trend	response
t_017	super-user	growing	"4. It would sting a little but the tool earns it."
t_042	regular	flat	"3. Tolerable, but it would make me comparison shop for the first time."
t_008	occasional	declining	"2. I'm not using it enough to absorb an increase."
t_033	dormant	declining	"1. I'd cancel rather than pay more for something I don't open."

Notice how cleanly the second run parses compared to the first: because the prompt demanded "only the number, then one sentence," the rating extracts with one regex and the explanation is still there when you want to read it. That discipline is the first of two craft rules covered just below. And because every output row carries the segment labels from Step 2, analysis is a one-liner:

import pandas as pd
df = pd.read_csv("results.csv")

# Quantitative question: extract the number, compare across segments
df["rating"] = df["response"].str.extract(r"(\d+)").astype(float)
print(df.groupby("engagement_level")["rating"].agg(["mean", "std", "count"]))

Which, on a full panel, prints something like:

                  mean   std  count
engagement_level
dormant           1.62  0.74     13
occasional        2.41  0.85     17
super-user        3.89  0.62      9
regular           3.05  0.78     11

And there's your first synthetic finding: acceptability of the increase climbs cleanly with engagement, the dormant segment is a cancellation risk almost to a twin, and the standard deviations are suspiciously tidy, which should already have you thinking about the dispersion check coming in the validation section. One run, two minutes, and you have a directional read plus a reminder of why you don't stop here.

Two craft rules for the questions themselves. For quantitative questions, be brutally explicit about format: "give me only a number from 1 to 5, then one sentence of explanation." The model will comply, and your parsing stays trivial; leave it open and you'll be regexing through three paragraphs of hedging. For qualitative questions, ask exactly what you'd ask a human ("walk me through how you decide whether a new feature is worth your time") and analyze the responses the way you'd analyze open-ends: theme them, slice them by segment, and pay attention to the language, because harvesting the vocabulary and objections of a segment before you write a survey or a campaign is one of the highest-value uses of the whole exercise.

The second hard question: what can you legitimately ask?

This is the other genuinely open problem, and it deserves more thought than it usually gets. A twin's answer is only as licensed as its grounding data. Everything beyond the data is the base model improvising in character. So before fielding any question, run it through a simple coverage rubric:

Green: directly evidenced. The question maps onto fields the twin actually has. "How would you feel if the export limit you've hit twice this quarter were doubled?" Our example twin has direct evidence here. Answers in this zone are the twin's strongest output.

Yellow: plausibly inferable. The question requires reasoning one step beyond the data. "Would you pay 20% more for the plan you're on?" The twin knows its usage intensity and trajectory, and a declining, low-adoption user should balk. The model's inference will usually be directionally sensible, but you are now reading reasoning the person never expressed. Treat these answers as hypotheses, not findings.

Red: unevidenced. The question touches attitudes, identity, emotion, or context the data simply doesn't contain. "How does your team's culture affect how you use the tool?" "Do you trust our company with your data?" The twin will answer fluently, because the model always answers fluently, and the answer will be a statistically plausible opinion for someone with that profile, drawn from the model's prior. It is not information about your user. The fluency is the trap: red-zone answers read exactly as convincingly as green-zone answers, and nothing in the output marks the difference. You have to carry the rubric in your own head.

This rubric also answers "what data should I add next?" Work backwards from the questions your stakeholders actually want answered. If the questions are mostly red against your current data, that's your collection roadmap: a short attitudinal survey, an NPS verbatim, an onboarding questionnaire, all of it shifts questions from red toward green.

Step 6: Validation, or you're doing improv with extra steps

Everything up to here takes a few days. This section is the actual project. Without it you have a fluent system with unknown error, which is worse than no system, because it generates confidence. I'll give you three levels, cheapest first.

Level 1: Internal consistency

The minimum bar: does each twin's answer at least point in the same direction as its own data? You encode your expectations as explicit rules:

def expected_stance(u) -> str:
    """What should this user plausibly feel about a price increase?"""
    if u["engagement"] == "dormant":
        return "negative"            # barely uses it, won't pay more
    if u["engagement"] == "super-user" and u["trend"] == "growing":
        return "tolerant"            # getting clear value
    if u["trend"] == "declining":
        return "negative"
    return "ambivalent"

Then score each twin's actual response (a keyword classifier, or better, a second LLM call asking "is this response positive, negative, or ambivalent about the price change?"), compare against expectation, and compute a weighted accuracy where adjacent stances get half credit:

def consistency(actual, expected) -> float:
    if actual == expected:
        return 1.0
    adjacent = {("negative", "ambivalent"), ("ambivalent", "negative"),
                ("tolerant", "ambivalent"), ("ambivalent", "tolerant")}
    return 0.5 if (actual, expected) in adjacent else 0.0

score = sum(consistency(a, e) for a, e in pairs) / len(pairs)

What you're looking for at this level is the absence of hard contradictions: no dormant, disengaged twin professing love for the product, no declining account volunteering to pay more. A well-built pipeline should achieve this almost completely, which is exactly why passing Level 1 means very little on its own. It's a floor, not a result.

Two traps at this level that you will hit, so hear them now. First: your evaluation harness needs evaluating too. A keyword sentiment classifier built for prose will score the bare answer "3" as nothing at all, and a perfectly well-behaved numeric run will grade out at 0%. When a validation run returns 0% or 100%, suspect the ruler before the thing being measured. Second: if you compare twin responses against existing research findings using naive keyword matching, you will manufacture false confirmations at industrial scale, because when the question is about pricing, the word "pricing" appears in every response and matches every claim. Validation theater is worse than no validation; it launders the model's prior into the costume of evidence.

Level 2: Ground truth from real humans

Internal consistency is necessary, not sufficient. A twin can be directionally coherent and still systematically wrong. Level 2 compares twin answers against something real people actually said. The cheapest version for an in-house team: take a past survey you've already run, one with results broken down by segment, put the identical questions (same wording, same scale) to your twins, and compare the distributions segment by segment:

# real_pct and twin_pct: % choosing top-2-box per segment, aligned
mae = (real_pct - twin_pct).abs().mean()

As working thresholds: a mean absolute error under roughly 10 percentage points means the twins are reproducing real attitudinal patterns in your population; over roughly 25 points means your grounding data isn't carrying attitudinal weight and needs enrichment before the panel is trustworthy on questions of that kind. In between, you have a directional instrument: trust the ordering of segments, not the levels.

If you have no past survey, run a small one. Fifty real respondents answering five questions is a modest cost, and it converts your entire twin program from faith to measurement. It is the single best research investment in this whole blueprint.

Level 3: The metrics from the literature, and how to read them without fooling yourself

Whatever else you take from the academic work, take these four habits. They are cheap, and each one catches a specific way of lying to yourself.

1. Always run the baseline ladder. Score your twins, then score three dumber alternatives on the same questions: (a) random responses, (b) an "empty persona," meaning the bare model with no profile at all, and (c) a demographics-only persona (just age, role, region, plan tier). This ladder is devastatingly clarifying. In the Columbia study, random responses scored 0.63 on their accuracy metric, the empty persona 0.73, demographics-only 0.75, and the full data-rich twins 0.75. Read that again: on individual-level accuracy, all the rich personal data added approximately nothing over a generic stereotype. The full twins did meaningfully beat demographics on correlation (0.20 vs. 0.15), which is a different and more defensible claim. The ladder is what separates "our twins are 75% accurate" (meaningless) from "our twins beat a demographic stereotype by X on metric Y" (a finding). If your full twin doesn't clearly beat the demographics-only persona, you have built an expensive stereotype generator, and you should know that before anyone makes a decision with it.

2. Know which correlation you're looking at. This is the gotcha that reconciles the 85% paper with the funhouse-mirror paper, and it's the question to ask of every twin validation you ever see. You can correlate across questions within a person ("is this a coherent portrait of this individual?") or across people within a question ("can the twins tell me who in my population leans which way?"). The first produces flattering numbers. The second is the one that matters for almost every applied question you have (who will churn, who will upgrade, which segment will revolt), and it's the one that currently comes out around r = 0.20 on novel questions. When someone shows you a correlation, ask: across what?

3. Check the dispersion. Compare the standard deviation of twin answers to the standard deviation of human answers on the same question (or, lacking humans, just look at whether your panel ever disagrees with itself). The documented pattern is severe under-dispersion: in the mega-study, twins were less varied than humans on 154 of 164 outcomes. The model's prior pulls every twin toward the same reasonable middle. Real humans have stronger feelings and weirder distributions. A quick smell test on any run: if a hundred twins produce ninety-five ambivalent answers, you don't have a panel, you have one model wearing a hundred hats.

4. Use novel stimuli. Language models have read the literature. The mega-study found twins faithfully reproduced a famous, textbook behavioral effect while completely missing a lightly modified version of the same effect, while their human participants did the opposite. If you validate on well-known questions, famous scale items, or scenarios that resemble published studies, you are partly grading the model's recall, not its simulation. Write fresh scenarios, fresh wordings, fresh stimuli.

Interpreting the results: what twins are for

After the building and the validating, here is the honest scorecard the evidence supports.

Twins are credible for relative, directional questions. Which segment will react worst to a change. Which of six concepts dies first. What objections a declining-usage cohort raises, in what words, before you draft the messaging. Whether a survey question is confusing before you field it to real people. Whether an experimental design produces any signal at all before you spend the budget. In all of these you're using the panel as a fast, cheap sorting and pressure-testing instrument, which is precisely where the published evidence finds real signal.

Twins are not credible for absolute numbers. Adoption rates, willingness to pay, satisfaction levels: the means will be off, the variance will be compressed, and the errors are systematic rather than random, which is worse. The documented tilts are worth keeping as a literal checklist, because each comes with a one-line test you can run on your own panel:

Under-dispersion: twin answers cluster toward the middle. Test: SD ratio versus humans.
Stereotyping: when twins deviate from the model's default, they deviate toward demographic caricature rather than toward the individual. Test: are full-twin answers closer to demographics-only answers than to real humans' answers?
Representation bias: twins are more accurate for some groups than others (in the published work: more accurate for higher-education, higher-income, politically moderate people). Test: break your accuracy scores out by subgroup. If you're using twins to "hear from" underrepresented users, understand that those are precisely the least faithful twins in your panel.
Ideological tilt: divergences from humans run in consistent directions (more trusting, more pro-technology, less privacy-concerned). Test: check whether twin-human gaps across your outcomes all point the same way.
Hyper-rationality: twins know things and reason in ways your users don't. In one study, twins scored 99.9% on factual knowledge questions where the matched humans scored 52%. Test: include questions with objectively correct answers; if your twins ace them, remember that your customers won't. For any research question where confusion, ignorance, or irrationality is the finding, a twin is structurally incapable of delivering it.

The reframe I'd urge you to internalize, borrowed from the mega-study authors: stop thinking clone, start thinking well-informed advisor with a decent memory of someone's file. An advisor like that is genuinely useful for pressure-testing your thinking and prioritizing your real research. You just wouldn't seat them in a focus group and call them your customer.

Where this blueprint is deliberately simple, and the upgrades that earn their cost

Before the open questions, an honesty check on the blueprint itself. Some of its simplicity is the point: in-context prompting is the validated architecture (fine-tuning lost in both papers), survey-style elicitation doesn't need memory streams or retrieval, and pandas one-liners are the right tool for a first build. But five of the simplifications are substantive, and you should know which ones you're making and what the upgrade looks like.

1. Role-play versus prediction. The system prompt in Step 4 asks the model to be the user: first person, stay in character. The strongest published results came from a different framing: show the model the person's data in third person and ask it to predict how that person would respond. These are not the same task, and the difference isn't cosmetic. Role-play activates the model's improv instincts; it performs a character, and characters have dramatic arcs and consistent personalities in ways real survey respondents don't. Prediction activates something closer to calibration: the model reasons about a person rather than inhabiting one. Role-play gives you richer qualitative texture and language you can harvest; prediction gives you better-behaved quantitative answers. The upgrade: run your quantitative questions through a prediction-framed prompt ("Based on the profile below, predict how this user would answer the following question") and keep role-play for the open-ended work. Better still, run both on the same questions and compare against your Level 2 ground truth. The framing comparison is itself a worthwhile experiment, and almost nobody is running it.

2. The missing reasoning layer. Our pipeline goes straight from profile to answer. The published architectures insert a reasoning step in between, and it carries real weight. One version is forced chain-of-thought at answer time: before committing, the model must describe what kind of person would choose each option, reason about why this person might choose each, and only then answer. Another is pre-computed reflection: before any questions are asked, prompt the model to read the profile through several analytical lenses (in the academic work, four social-science expert personas) and generate explicit inferences about the person, then include the relevant inferences alongside the profile at question time. Notably, when one team fine-tuned a model and had to strip out the reasoning step to do it, performance dropped. The upgrade costs you one extra prompt template and some tokens. If your validation scores plateau, this is the first lever to pull.

3. Level 1 validation is partly circular. The consistency check in Step 6 derives expected stances from the same data we fed the twin. So what it actually verifies is that the model read and respected the prompt, not that the twin resembles a human being. That's still worth checking (a pipeline that fails it is broken), but be precise about the claim it licenses: Level 1 measures prompt adherence. Only Level 2, comparison against real human responses, measures fidelity. If you find yourself reporting Level 1 scores as "accuracy" to stakeholders, you've crossed from validation into theater, and the circularity is exactly the objection a sharp reviewer will raise.

4. The missing denominator: human self-consistency. Our Level 2 compares twins against humans as if humans were a fixed standard. They aren't. People asked the same questions two weeks apart agree with themselves only about 80% of the time on attitude items, and that figure is the true ceiling for any twin. The most important metric idea in the literature is to normalize for it: twin accuracy divided by the human's own test-retest consistency, so that 1.0 means "the twin predicts this person as well as the person predicts themselves." Without the denominator, you'll judge your twins against an implied 100% that no human meets, and conclude they're worse than they are (or, when comparing instruments of different reliability, draw exactly backwards conclusions). The in-house upgrade is cheap: when you run the small ground-truth survey from Level 2, re-survey the same people two weeks later. Thirty respondents answering twice gives you a working denominator.

5. No twin stability check. We never tested whether a twin agrees with itself. Ask the same twin the same question twice, and a paraphrased version a third time. If it gives a 3 today and a 5 to a reworded version tomorrow, every cross-segment comparison you run is partly noise, and no amount of human ground truth fixes that. This is the cheapest test in the entire article (a loop and a correlation), it has no human-data requirement at all, and a twin that fails it disqualifies itself before any fidelity question even arises. Run it first.

None of these upgrades changes the skeleton. They slot into the pipeline you've already built: a second prompt template, an extra pre-processing call, a re-survey, a loop. The reason to know about them now rather than later is that they determine which claims your results can support, and the gap between "my twins are consistent with their own prompts" and "my twins predict my users at 70% of those users' own reliability" is the gap between a demo and an instrument.

The open questions, honestly labeled

I promised to flag what's unsolved, because the gap between this technology's marketing and its evidence lives exactly here, and because these are the questions your own in-house experimentation can actually contribute to.

What data to feed the twins. Interviews, surveys, and behavior have each shown partial signal; the optimal mixture is unknown, and behavioral grounding (the kind most companies have) is the least studied of the three. Run additive experiments: one new data source at a time, measured against your validation suite.

Is more data better? The evidence says: much less than intuition suggests, with fast saturation and a possible ceiling in the predictability of people themselves. The interesting frontier isn't volume, it's kind: a handful of attitudinal answers may shift more red-zone questions to green than another year of telemetry.

What questions are askable? The coverage rubric above is a heuristic, not a science. Nobody has rigorously mapped which question types degrade how fast as they move away from the grounding data. Your Level 2 validation, run across question types, is a genuine contribution to that map.

Model and configuration. The mega-study's sweep found a mostly flat landscape: the best configuration was a frontier model at temperature 0, newer and fancier models didn't reliably help, and fine-tuning didn't either. Spend your budget on validation and data quality, not on model shopping.

Refresh cadence. A twin is a snapshot, and people drift. Nobody knows how fast twin fidelity decays as the grounding data ages. If your product changes quickly, assume months, not years, and rebuild accordingly.

And the question that isn't technical at all: consent and governance. You are building simulacra of real people from their data. The academic teams spent months with review boards on exactly this. In-house, at minimum: build from aggregated or pseudonymized records, never construct a twin of an identifiable individual without explicit thought (and probably legal review), keep the panel access-controlled, and be honest in every readout that the artifact is a simulation. The fastest way to discredit this method inside your organization is to be casual about whose data it's made of.

Closing

The build is genuinely small: some data plumbing, a profile builder, a prompt template, a parallel runner, a few hundred lines of Python. Anyone reading this can have a semi-working panel in a few days. A panel that deserves anyone's trust takes considerably longer, and that asymmetry is exactly the point. The build is not the project. The validation harness, the baseline ladder, the coverage rubric, the dispersion check, the willingness to distrust a 0% and a 100% equally: that's the project. That discipline is what separates a research instrument from a confidence machine, and right now most of the people selling and buying this technology are skipping it.

So build one. It's the fastest way I know to develop calibrated intuition about where synthetic methods work and where they quietly fail. Then test it like it's lying to you.

Because parts of it are.

🎯 Subscribe to The Voice of User. Unlike your twins, it is exactly what it claims to be. One essay a week. No fluff. No spam. Just clarity, sarcasm, and survival notes.