Deep Research: Comparing Model Performance
Multi-step AI workflow for comparing models head-to-head on a real dataset.
Overview
Run multiple models in parallel against the same inputs from your log set, compare their outputs side by side, and pick the best one. The same pattern works on any input column: agent traces, chat prompts, support tickets, or document corpora. This walkthrough uses a public dataset of financial news articles financial-news-articles as a concrete example, but you can substitute the user-message column from your own chat logs or the prompt column from your agent traces.

Steps
- Load the dataset
- Give the model a research prompt
Use chat to request: "Summarize each news article using claude-haiku, claude-sonnet, and gpt-5-mini. Then compare their summaries and explain how the models perform differently. Which model do you recommend for best quality?"
Expected Results
- Summary Comparison: An analysis of the strengths and weaknesses of each model's summaries
- Model Recommendation: A recommendation for which model to use for best summary results
Other Use Cases
- Dataset Discovery - Use natural language to search and discover datasets
- Data Transformation - Categorize and derive insights from unstructured text
- Patient Data Workflow - Extract, filter, and export structured medical data
- Quality Filtering - Remove low-quality responses from datasets