13 min read

Towards an AI Maturity Ladder for UXR Teams

Towards an AI Maturity Ladder for UXR Teams

After I published my piece on micro research, I had a lot of conversations with researchers at different companies about how they are actually using AI in their work. And what became clear very quickly is that "using AI" means wildly different things depending on who you are talking to.

For some teams it means pasting transcripts into ChatGPT. For others it means purpose-built platforms with evidence trails. For a few it means rigorous calibration and error tracking. But almost everyone describes it the same way: "Oh yeah, we use it, but we always have a human in the loop."

Cool. What does your loop actually look like?

Because "human in the loop" can mean anything from rigorous evidence review to glancing at AI output and thinking "yeah, that sounds about right" before pasting it into a deck. And in my experience, it is usually closer to the second one.

That is not a human in the loop. That is a human next to the loop. Watching the loop. Cheering for the loop. Occasionally giving the loop a thumbs up as it runs past.

We need a shared vocabulary for what responsible AI use actually looks like in UXR. Not "just make sure a human checks the output." That advice is everywhere and it tells you nothing. We need something more concrete. A ladder.

This is a first attempt at building one. It will evolve. But even an imperfect framework beats the current state, which is everyone saying "human in the loop" and meaning completely different things.

The AI Maturity Ladder for UXR Teams has four levels. Each one is a real, achievable step. And I need you to be honest about where you actually are, not where you tell people you are in interviews.

Level 0: No AI

You do everything manually. Transcription, coding, synthesis, write-up. The whole thing is handcrafted, artisanal research.

This is fine. This is also increasingly rare and increasingly slow relative to what organizations expect. I am not going to romanticize it. Manual does not mean rigorous. Plenty of sloppy research was produced long before AI showed up. I have seen enough pre-AI decks with "most users" based on three people to know that fabrication is not an AI-specific problem.

Some well-known companies with established research teams are still operating fully at this level. No judgment on the researchers themselves. But the reality is that UXR staying at Level 0 in 2026 is going to struggle. Not because manual research is bad, but because the speed gap between what you can deliver and what your organization expects will keep growing. And when research cannot match the tempo of product decisions, teams stop waiting for you. They make the call without you and backfill the rationale later. Your impact shrinks not because your work got worse but because it stopped arriving when it mattered.

The other problem is that Level 0 has no natural defense against the teams around you adopting AI badly. If engineering, design, and product are all using AI to generate their own "insights" from customer data, and you are the one function that cannot move fast enough to offer a credible alternative, you do not win by being pure. You lose by being absent.

But if you are here and it is working, nobody is forcing you to move. Just know that the rest of this article is for people who have already started climbing and need to figure out which rung they are actually standing on.

Level 0.5: The Vibes Level

This is not officially on the ladder but it is where most teams actually live, so we are going to talk about it.

You paste transcripts into ChatGPT or Claude. You ask it to "find the themes" or "summarize the key insights." It gives you five clean bullet points. You put them in a deck. Maybe you change "users expressed frustration" to "participants expressed frustration" because you are a professional. You present it as findings.

There is no audit trail. There is no evidence linking. There is no way for anyone, including you, to trace a claim back to a specific participant or moment. If someone asks "where did this insight come from?" the honest answer is "the AI said so and it sounded right."

At an organizational level, this is where UXR starts losing credibility without realizing it. The decks look great. Leadership is happy with the speed. But you are one "where did that come from?" away from a trust collapse. And when it happens, it does not just discredit that one finding. It raises questions about everything your team has shipped. The worst part is that you might never even get that question. Leadership might just quietly stop acting on your recommendations and you will wonder why your influence is shrinking.

This level is popular for a reason. It is fast. It feels productive. The output looks legitimate. And here is the truly dangerous part: it is right often enough that you start trusting it by default.

But "often enough" is not a research standard. "Often enough" is how you end up with phantom personas that read perfectly but are not grounded in actual evidence. It is how you end up with quotes in decks that nobody actually said because the AI "cleaned them up" into something more articulate. It is how you end up with "most users" when it was two people. Two.

The danger at this level is not that the AI is wrong sometimes. It is that you have no mechanism for knowing when it is wrong. You are flying blind and the instrument panel is showing you a beautiful sunset.

If this is you, do not feel bad about it. But stop telling people you have a human in the loop. You have a human adjacent to the loop. And those are very different things.

Level 1: AI for Formatting and Summarization Only

This is your safe starting point. The first real rung. The AI does not make new claims. It makes your existing claims prettier.

What this looks like in practice: You do your analysis. You identify your themes. You write your findings. Then you use AI to clean up the writing, format the deck, summarize a long transcript so you can navigate it faster, or restructure your notes into something more readable.

The key constraint is directionality. The AI is working downstream of your thinking, not upstream of it. It never sees raw data and produces conclusions. It sees your conclusions and helps you communicate them. This distinction sounds subtle. It is not. It is the entire difference between a research assistant and a research replacement.

"But Constantine, isn't that just using AI as a fancy text editor?"

Yes. That is exactly what it is. And that is the point.

This level sounds boring. Good. Boring is trustworthy. You should spend a meaningful amount of time here before moving up. Not because I enjoy gatekeeping, but because every level above this requires infrastructure that most teams have not built yet. And infrastructure at this level means one thing: getting your evidence house in order.

What you should build at this level:

A source-of-truth folder structure. Recordings, transcripts, survey exports, event logs. All in one place. Easy to retrieve in minutes, not hours. This matters more than any AI tool you will ever adopt, because without it you cannot verify anything regardless of who or what produced it.

Provenance metadata for everything. Participant, date, product version, task, condition. I know. It is tedious. Future you will be grateful. Current you will be annoyed. Do it anyway. You are building the foundation that makes every level above this possible.

If you skip this infrastructure and jump straight to Level 2, you will end up right back at Level 0.5 with extra steps.

Organizationally, this is where UXR starts rebuilding or maintaining trust on solid ground. Your speed improves because the AI handles the tedious parts, but every claim is still yours. When someone challenges a finding, you can back it up. That sounds basic. In 2026 it is becoming a competitive advantage. Teams that can say "here is exactly where this came from" are the teams that keep their seat at the decision table.

Level 2: AI Suggests, Humans Decide With Evidence Cards

Now we are getting somewhere interesting.

At this level, the AI has opinions. You have veto power. And critically, every acceptance and every rejection comes with evidence. Not vibes. Not "that sounds about right." Evidence.

What this looks like in practice: You use AI to suggest codes from your transcripts. You use AI to propose thematic groupings. You use AI to draft a codebook based on your research questions. But nothing gets accepted without a human reviewing it against the source material and attaching an evidence card.

Evidence cards are the thing that separates Level 2 from Level 0.5 wearing a lab coat. For every theme, whether the AI proposed it or you did, you create a card with four fields:

The claim. What are you asserting?

The supporting evidence. Direct excerpts from actual transcripts. Not paraphrases. Not summaries. The actual words that actual people said.

The counterexamples. What did you see that contradicts or complicates this theme? If the answer is "nothing," you probably did not look hard enough.

The scope notes. How many people? Which segment? What context? What do you explicitly not know?

If you cannot fill out the card, the theme does not exist yet. It is a hypothesis. Label it that way. There is nothing wrong with hypotheses. There is a lot wrong with hypotheses that got promoted to findings without earning it.

This is the level where purpose-built research tools start earning their place. Platforms like Dovetail and Looppanel are designed for AI-assisted coding and tagging against actual transcripts. They keep the evidence connected to the output in ways that pasting into ChatGPT never will. The difference matters. When the tool is built around the research workflow rather than bolted onto it, evidence linking becomes a feature instead of a discipline you have to enforce manually.

What you should build at this level:

Spot checks. Double-code a subset of your data manually. Compare the AI's coding to yours. Where does it agree? Where does it diverge? Why? This is not busywork. This is calibration. You are learning the shape of the AI's blind spots so you know where to pay extra attention.

A disconfirming evidence requirement. Make it a mandatory section in your analysis doc. Not optional. Not "if time allows." Required. The AI will not hunt for disconfirming evidence because it is optimized to build coherent stories, not to poke holes in them. That is your job. It is one of the most important parts of your job. Do not outsource it.

Session boundaries for AI. If the AI is present during live sessions (real-time transcription, live notes), keep it in observer mode. It watches. It does not decide. It does not summarize in real time for stakeholders who are watching along. That path leads to premature synthesis faster than anything I have seen. And premature synthesis is just a polite way of saying "we decided what the findings were before we finished the research."

This is the level where UXR starts to scale without multiplying headcount. You can handle more questions per quarter because the AI is doing real analytical work, not just formatting. But because every output goes through evidence cards, your team's credibility actually increases as throughput goes up. That is rare. Most functions trade quality for speed. At Level 2, you are trading process overhead for speed and keeping quality constant. That is the pitch your leadership wants to hear.

This is the level most teams say they are at when you ask them. This is the level almost nobody is actually at.

Level 3 is powerful. At this level, the AI drafts full findings sections. It writes the narrative. It proposes recommendations. It might even structure the deck. It does real work. And you let it, because you have earned the right to let it by building the infrastructure in Levels 1 and 2.

But here is the non-negotiable that makes Level 3 Level 3 and not Level 0.5 with better prompts: nothing ships without trace links for every claim.

Think of it like code review. A pull request does not merge without review. A finding does not ship without evidence links. Not some findings. Not the important ones. All of them.

This is also where end-to-end AI research platforms like Outset.ai start making serious sense. These tools can moderate, capture, and synthesize responses from real participants at scale. The critical word there is "real." These are not synthetic users or simulated personas. These are actual people answering actual questions. The AI handles moderation and initial synthesis, but the evidence trail stays intact because the platform is built to preserve it. That is fundamentally different from pasting a transcript into a general-purpose LLM and hoping for the best.

The Deck Gate:

Every key claim has at least two direct evidence anchors. Quotes, clips, screenshots, logs, survey items. Pick two minimum. One is weak. Zero is not a finding. It is fan fiction.

Every prevalence statement includes an explicit base. N, segment, study type. "Many users" is not a finding. It is a guess wearing a trench coat. Replace "most" with "5 of 7 participants in the rider segment during the US study." Yes, it is less elegant. It is also not made up.

No verbatim quote goes in a deck unless a human verifies it against the recording or transcript. Zero exceptions. I mean zero. The AI will "clean up" quotes in ways that change meaning. Sometimes subtly. Sometimes not. A participant saying "I guess it was fine" and the AI rendering that as "the experience met expectations" is not a minor edit. It is a different finding.

What you ship at this level:

Two layers. The executive narrative for the meeting and an evidence appendix that can be audited in minutes. If someone questions a decision your research informed, tracing it back should take thirty seconds, not a scavenger hunt through old decks.

If any item in the deck gate fails, the finding stays in working notes. It does not go in the readout. Period. I do not care how good the story is. If it cannot show receipts, it stays in draft.

At this level, UXR becomes genuinely hard to ignore in the organization. You are fast. You have receipts. Research stops being the function people route around when they are in a hurry and starts being the function people pull into decisions because it actually makes them better and faster.

Level 4: AI as a Calibrated Research Instrument

This is the top of the ladder. Almost nobody is here. But it is worth describing because it shows where all of this is heading.

At Level 4, you stop treating AI as a novelty or a shortcut and start treating it like what it actually is: a research instrument. And like any research instrument, you calibrate it, you measure its reliability, and you track its error rates over time.

What this looks like in practice:

You maintain a log of AI-generated outputs that were verified, modified, or rejected. You look for patterns. Does the AI consistently inflate certain types of findings? Does it struggle with specific participant populations? Does it miss nuance in particular product domains? Does it have a tendency to round ambiguity into certainty in specific, predictable ways?

You run periodic audits. Pull a random sample of shipped findings. Trace them back to raw evidence. Not because you suspect a problem, but because that is what quality control looks like. Every research instrument drifts. AI is not an exception.

You develop explicit criteria for when AI assistance is appropriate and when it is not. And those criteria are based on empirical evidence from your own practice. Not from vibes. Not from vendor marketing. Not from a blog post some guy wrote. From your data about your process.

At this level, you are evaluating the tools themselves. How accurate is Dovetail's auto-coding in your domain? How does Outset's AI moderation compare to a human moderator for your population? Where does Looppanel's thematic analysis diverge from manual coding, and is the divergence systematic or random? You treat every tool the way you would treat a new survey scale. You do not assume validity. You test it.

This is where AI in UXR becomes genuinely transformative. Not because the AI got smarter. Because you understand exactly where it is reliable and where it is not. You have research on your research tools. You have data about your data process. And that meta-knowledge is what lets you push the speed dial further without the whole thing falling apart.

Most teams will not get here for a while. That is okay. But knowing this level exists changes how you think about every level below it. You are not just "being careful." You are building toward a system.

This is where UXR transforms from a service function into a strategic capability. You do not just produce insights. You produce insights with known reliability, and you can tell leadership exactly how much to trust them. That is a fundamentally different conversation than most research teams have ever had. You are not asking for a seat at the table. You are the reason the table makes better decisions and everyone knows it.

So Where Are You? (For Real This Time)

Here is the uncomfortable exercise. Forget what you put on your LinkedIn post about responsible AI. Forget what you said in your last job interview. Look at your last three projects.

Could you trace every claim in the final deck back to a specific piece of evidence in under five minutes? If yes, you might actually be at Level 3.

Did you create evidence cards with counterexamples? If yes, you are at Level 2.

Did the AI only touch your work after you had already done the analysis? Level 1.

Did you paste a transcript into an LLM and use what came back as a starting point for your findings without systematic verification? Level 0.5. And that is where most people are. It is not a moral failing. It is a process gap. But it is a process gap that will eventually produce a finding that influences a decision that ships a thing that did not need to be shipped. And by the time anyone traces it back, the damage is done.

The Part Where I Tell You Not to Panic

If you are reading this and doing uncomfortable math about insights you already shipped, here is what to do.

Identify the blast radius. Which decks, roadmaps, and decisions were informed by AI-assisted analysis that did not go through the checks above?

Pull the top ten claims that influenced actual decisions. Re-verify them from raw evidence.

Re-label anything that cannot be traced as "unverified." Remove it from active decision inputs. This will feel terrible. Do it anyway. A temporarily painful conversation now is better than a permanently wrong product decision later.

Then patch the process. Fixing one deck does not fix the system that produced it.

Why This Matters More Than You Think

Here is the thing people miss when they hear "be more careful with AI."

AI increases your throughput. Which means research discipline becomes more valuable, not less. The competitive edge for UXR is not autonomous research. Nobody serious wants that. Autonomous research is just a fancy way of saying "we stopped checking and we are fine with it."

The edge is faster research that can show its work. Every level of this ladder makes you faster than doing things manually. Level 1 is faster. Level 2 is faster. Level 3 is significantly faster. This is not about slowing down. It is about knowing what you know and being able to prove it when someone asks.

The researcher who can move fast and show receipts is the one who becomes indispensable. The one who moves fast and cannot explain where the insights came from is building on sand and calling it a foundation.

Look at the ladder. Be honest. Move up one level. Not four. Just one.

Once you have the infrastructure from Levels 1 through 3, something interesting happens. You stop needing two-week cycles for every question. You can scope tighter, move in 24 to 72 hours, and still have auditable evidence behind every claim. That is a different operating mode for research. Not traditional research done faster. Something else entirely. I have been writing about that separately and will share more soon.

And if your workflow cannot show receipts quickly, fix that before you fix anything else. Because without receipts, everything you ship is one audit away from being reclassified as fiction.

If any of this resonates and you want to swap notes on how it plays out in practice, reach out. The more complex the org, the more interesting the conversation.

🎯AI does not kill research credibility. Skipping the evidence does. If you want unfiltered writing on how UXR actually works (and why faster does not mean better unless you earn it), subscribe.