Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.benchgen.com/llms.txt

Use this file to discover all available pages before exploring further.

Eval Overview

Eval is BenchGen’s benchmarking module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning.

What Eval Does

Upload or connect a model, point it at a benchmark, and Eval runs the model against every test case. It scores each response, aggregates the results into a report, and flags the failure patterns that matter most. You get:
  • A per-question pass/fail breakdown
  • Aggregate accuracy, latency, and cost metrics
  • Exportable failure cases ready for fine-tuning

When to Use Eval

SituationWhat to do
You have a new base model and want a baselineRun a benchmark before any fine-tuning
You’ve just finished a training runRun the same benchmark again and compare
Your agent is returning bad answersExport failing chat examples as a benchmark
You want to compare two modelsRun both against the same benchmark and diff the results

What Eval Hands Off

  • → Train: export failing examples as a labeled dataset to kick off a fine-tune.
  • → Agents: benchmark results tell you which model to connect to your agent.

Next Steps