Generative Agent Simulations of 1,000 People
A paper that thoroughly executes a parity study between Synthetic and Organic users.
How do we know we are right? How do we know our Synthetic Users are as real as organic users? We compare.
At Synthetic Users we have a page stuck to our wall that simply reads: The best possible user interviews using LLMs. That’s great but how do we ensure they are indeed the best? How do we know we are right?
‍
‍
‍
We use various services to recruit users. Prolific.com Usertesting.com UserInterviews.com All we can say is that it’s painful. That participants don’t show up… but that this whole process is absolutely necessary for us. It’s the only way we know we’re getting better Synthetic Users.
‍
‍
We used Lookback to run the interviews which are then transcribed.
In this case, the interviews aim at exploring the how primary and secondary school teachers in the UK incorporate technology into their classrooms.
‍
Here is a zoomed out version. At the end of every interview we append the topics mentioned in that interview (effectively starting our codebook: a collection of themes/topics we can then compare with the Synthetic side).
‍
‍
‍
First, we identify the themes and topics that are common in both interviews.
‍
‍
Qualitative Analysis: We manually review the flow of conversation, use of idiomatic expressions, and overall readability. We call this glanceability and with it we are assessing wether the responses are relevant and appropriate for the questions asked, mirroring the real interview's context-responsiveness. To be honest, for the purposes of this comparison study we are initially more interested in content overlap. This is because we can always drill deeper with Synthetic Interviews if we feel the interview stays too shallow).‍Context Analysis: we determine how well the participant responds to the context provided in the interview questions. We do this through topic consistency comparing the number of topics brought up, first with these organic interviews, later with synthetic interviews.
Here’swith the manual process from a previous comparison study.
‍
‍
To make this more measurable we also run a Word and Phrase Frequency Analysis + N-gram Analysis where we compare the frequency of certain words or phrases to identify linguistic patterns. There is software out there like NVivo that can help you do this quicker.To perform a quantitative textual analysis we extract and count the occurrences of individual words and phrases that are relevant to the themes of both interviews. In this case we use a simple GPT. Given the length of the interviews, we focus on key terms related to their challenges, emotions, and solutions. It’s important to narrow down the criteria for parity otherwise the target is too large.
‍
‍
If you run organic interviews on a regular basis, you are familiar with the process up to here.
‍
‍
‍
‍
‍
‍
‍
You can see the 8 Synthetic Interviews on the right
‍
‍
‍
Based on these common themes, we quantify the level of overlap. Given that these themes are quite fundamental to both interviews, they contribute significantly to the similarity score. If these were the only factors considered, the score might be quite high (around 90+), since these themes are central to both interviews.
‍
‍
‍
‍
Quantitatively calculating a parity number in this context involves subjective judgment since we're dealing with qualitative data. Here is a simple framework we use:
Thematic overlap (30%)
Depth and specificity of insights (30%)
Comprehensiveness of coverage (20%)
Qualitative alignment (20%)
In the case of this Comparison study, both sets of interviews score highly across all these criteria, with particularly strong performance in thematic overlap and qualitative alignment.
‍
‍
‍
‍
c. Be specific and get specificity back. Specifying users’ qualities will provide more specific answers. Be as specific as you can be.‍ This is specially relevant if you don’t want your Synthetic Users to default to more encompassing levels of consciousness, i.e. prioritising policy over personal budget.
‍
‍
This is what we tell all our users when they are starting out. Try it, compare it with an existing study. Put the Synthetic interviews next to the organic and see how comfortable you feel about it.
‍