Anchoring Effects in Self-Reported Experience Ratings: A Proposed AI-forward Calibration Instrument for CSAT
You have asked this question. I have asked this question. Somewhere right now a researcher is asking this question to a participant who is about to say 7, and both of them know the 7 means nothing, and the session continues anyway.
The experience scale is the most broken instrument in our field that everyone keeps using. Not because the question is bad. Because nobody knows what the numbers mean. Ask ten people what an 8 feels like and you get ten different answers, none of which match the 8 in your spreadsheet. Then we average them, call it CSAT, and tie someone's bonus to it.
Medicine figured this out decades ago. The pain scale has little faces on it. The face at 10 is crying. The face at 1 is having a normal day. You point at a face and the nurse knows what to do with you. Blood alcohol charts tell you exactly what 0.08 looks like, which is how we all learned that "slightly impaired judgment" is a legal category.
Experience has no faces. Experience has no chart. Experience has a 7.
So I built my own anchors. I have been using these for years. You are welcome to steal them, your participants will thank you, your dashboards will not.
The Calibrated Experience Scale
1. Mild food poisoning. The floor. Not severe food poisoning, that would be cheating. Mild. You can still attend the meeting, you just wish you couldn't. This is what a 1 feels like, and once your participants know this, they stop giving 1s to checkout flows that are merely confusing. Confusing is not food poisoning. Confusing is a 3.
2. Holding the door for someone slightly too far away. You made eye contact. You committed. They are now doing the little jog, and you are standing there holding a door for nine full seconds, performing patience. Nobody is having a bad time exactly. Nobody is having a good time either. That is a 2.
3. The jury duty orientation video. Produced in 2004. A judge explains civic responsibility over stock footage of a flag. You are not in pain. You are experiencing the absence of every other thing you could be doing, rendered in 480p. A surprising amount of enterprise software lives here permanently.
4. Watching a stranger parallel park while a line of cars forms. There is tension. There are stakes. You are invested against your will. Technically this is engagement, which should worry you about how your team measures engagement.
5. Getting the shopping cart that doesn't pull left. The midpoint of human experience. Things are working as intended. Nothing is delighting you. The cart goes where you push it. Half of all software aspires to this and does not reach it.
6. Peeling the plastic film off new electronics. Now we are getting somewhere. A small, complete, frictionless pleasure. It costs nothing, it lasts four seconds, and people will film it. If your product has one moment like this, your participants will mention it unprompted in every session. You will then build a roadmap around it and ruin it.
7. A meeting you dreaded gets cancelled. This is what 7 actually means, and it explains everything. A 7 is relief, not satisfaction. A 7 is the absence of a bad thing. When your product scores a 7, your users are telling you it did not hurt them. Put that in the QBR deck and see how the room feels.
8. Your flight is delayed, but you're already home and it's someone else's flight. Pure, undeserved, slightly shameful pleasure. The first number on this scale where the participant smiles while answering. If your product lives here, you do not need an experience metric, you need a waitlist.
9. The neighbor who reported your fence to the HOA gets a parking ticket. Justice. Narrative resolution. A 9 is not an experience, it is a story the participant will tell at dinner. You cannot design for a 9. You can only build the conditions and wait.
10. The dryer returns all your socks and there's twenty dollars in the lint trap. Has never been recorded in captivity. By definition, anyone experiencing a 10 cannot be reached for survey administration. The only confirmed 10s in the literature occurred in childhood, and that sample is unreachable for follow-up.
Methodology Notes
Participants
N = whoever answered the intercept. Recruitment consisted of a survey that appeared while participants were trying to do something else. Of those exposed, 4% responded, and of those, an unknown number were trying to make it go away. We refer to this as a convenience sample, although it was not convenient for anyone.
No screener was used. The screener is the intercept surviving long enough to be answered.
Apparatus
The instrument is a single item: "From 1 to 10, how good was that experience?" administered with the calibration anchors described above. Participants were shown the anchors before responding. Several asked if the food poisoning was hypothetical. We told them yes. We are no longer certain this improved data quality.
Procedure
Participants experienced a product, were interrupted, and were asked to convert the experience into an integer. The conversion happened in roughly 1.8 seconds, which is also approximately how long the experience itself was remembered. No incentive was offered beyond the survey's promise that it would "only take a minute," a claim that has never once been audited.
Validation
Test-retest reliability was established by asking twice. Both answers were 7. We consider the matter closed.
Internal consistency was excellent, because participants stopped reading after item 3 and selected a straight column of 7s. Cronbach's alpha came back at .96. Cronbach would like his alpha back.
Construct validity remains an open question, in the sense that nobody can say what the construct is. The item asks about the experience. The participant answers about their day. The dashboard records loyalty. Three constructs enter, one number leaves.
Analysis
The scale is ordinal, not interval. The distance between food poisoning and the door hold is not the distance between the HOA ticket and the lint trap money. Your stats package does not know this. It will compute a mean of 6.4 anyway, the 6.4 will go in a deck with an arrow pointing up from last quarter's 6.1, and a PM will ask you to decompose the 0.3. You cannot decompose the 0.3. The 0.3 is noise wearing a blazer.
Limitations
Nobody answers 5. Respondents pass directly from 4 to 6 like an elevator skipping the 13th floor. A 5 means telling you, to your face, that your product is a functioning shopping cart. People would rather lie.
The top of the scale cannot be measured at all. A true 9 or 10 is the state in which people stop being respondents. The diary entries go missing. The session overruns. The participant forgets you exist, which is the strongest positive signal in all of research and the one thing this instrument structurally cannot record. The 10 is inferred from the absence of data, like a black hole.
Finally, the instrument has not been validated outside the author's household, and the author's household has asked him to stop.
Ethics statement
No participants were harmed during calibration. One participant experienced stimulus 1, but it was mild.
Future Work
A 100-point version, for teams who believe the problem is resolution.
An NPS adaptation, in which the anchors stay the same but anyone below
the parallel parker counts as a detractor. A version that goes to eleven, for stakeholders who have seen Spinal Tap and will mention it.
Conclusion
The anchors will not fix your experience metric. Nothing will fix your experience metric. But the next time the number comes back as a 7, you will know exactly what your users are telling you.
P.S. The term "AI-forward" in the title refers to nothing. I added it for engagement after the original title underperformed in pilot testing (N = me, staring at it). This is the only part of the methodology that was validated.
🎯 New here? Subscribe. Most posts are more serious than this one. Not all of them.