Deep Research: Comparing Model Performance

Multi-step AI workflow for comparing models head-to-head on a real dataset.

Overview

Run multiple models in parallel against the same inputs from your log set, compare their outputs side by side, and pick the best one. The same pattern works on any input column: agent traces, chat prompts, support tickets, or document corpora. This walkthrough uses a public dataset of financial news articles financial-news-articles as a concrete example, but you can substitute the user-message column from your own chat logs or the prompt column from your agent traces.

Demo showing deep research in Hyperparam

Steps

  1. Load the dataset

    Open financial-news-articles

  2. Give the model a research prompt

    Use chat to request: "Summarize each news article using claude-haiku, claude-sonnet, and gpt-5-mini. Then compare their summaries and explain how the models perform differently. Which model do you recommend for best quality?"

Expected Results

  • Summary Comparison: An analysis of the strengths and weaknesses of each model's summaries
  • Model Recommendation: A recommendation for which model to use for best summary results

Other Use Cases

Deep Research: Comparing Model Performance - Hyperparam