Prototyped an analytics tool that helps product teams review AI outputs at scale and surfaces performance patterns with actionable fixes.
As a PM working on AI products, I repeatedly ran into the same bottleneck of diagnosing poor outputs. Every week I had to spend hours manually with my team to review hundreds of cases, trying to understand whether failures came from missing internal knowledge, insufficient guardrails in our prompts that allowed hallucination, or edge-case inputs that broke downstream logic.
The process was inefficient and rarely revealed clear next steps. This motivated me to prototype an analytics tool that could review AI outputs at scale, cluster failure patterns and recommend targeted fixes.
I mapped the exact steps I went through during my weekly case review process. This includes connecting to the data source, defining the judgement criteria, scanning through outputs to diagnose vague or wrong reponses, categorizing failure reasons, and deciding what to change. That became the backbone of the tool.
The biggest bottleneck was manually scanning every case. I scoped the tool to prioritize surfacing failures that were recurring and high-impact, allowing teams to focus on the most valuable problems instead of reviewing every example.
I then built a working prototype based on the defined requirements to automate my workflow:
Automated clustering of similar cases with the same failure pattern
LLM-powered recommendations for prompt improvements and optimizations
Once I had a working prototype, I shared it with peers working on AI products. What resonated most with them was the narrative layer. Instead of producing raw scores or pass/fail labels, the tool explained why outputs failed and pointed to concrete next steps.
By turning debugging into a more structured process, the tool helped me and my peers move efficiently from vague “bad outputs” to clear and actionable fixes.