Rashmi is a content strategy, information architecture, and data science practitioner with deep experience in technical documentation, content operations, and AI-assisted systems. She works at the intersection of content, IA, analytics, and AI, focusing on making complex information easier to find, use, and trust—for both people and AI. Over the past year, she has led the evaluation and production readiness of an AI-powered documentation chatbot, combining human review with automated analysis to understand performance, build confidence, and continuously improve the system.
Beyond CSAT: How a Golden Question Set Helped Us Build Trust in Our AI Agent
Most teams wait until after launch to ask whether their chatbot or AI agent is actually working, relying on CSAT, case deflection, or anecdotal feedback. We took a different approach: we focused on building trust in the AI before real users ever touched it. In this case study, I’ll share how we used search data to farm intents and create a “golden set” of user questions. This golden question set became the foundation for early baseline accuracy, structured human evaluation, internal rollout testing, and continuous automated self-evaluation after each update. I’ll also show how a zero-cost chatbot data analysis agent helped surface key intents and actionable insights, moving accuracy from 45% to 80% in three months and enabling a confident production rollout in less than six months. This session focuses on building confidence in AI through intent, human judgment, and scalable evaluation—not just metrics.
In this session attendees will learn how to:
- Use search data to create a golden set of user questions before launching an AI agent.
- Apply AI-assisted intent analysis to identify query patterns.
- Involve writers in human-in-the-loop evaluation to build shared ownership and trust.
- Design a structured evaluation scale that goes beyond “good/bad” answers.
- Move from manual testing to AI-based self-evaluation for rapid iteration.
- Define meaningful performance indicators tied to user intent—not just CSAT.
How the continuous evaluation enabled measurable accuracy gains (45% → 75%) and confident production rollout.


