Eval Overview

Eval is BenchGen’s benchmarking module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning.

What Eval Does

Upload or connect a model, point it at a benchmark, and Eval runs the model against every test case. It scores each response, aggregates the results into a report, and flags the failure patterns that matter most. You get:

A per-question pass/fail breakdown
Aggregate accuracy, latency, and cost metrics
Exportable failure cases ready for fine-tuning

When to Use Eval

Situation	What to do
You have a new base model and want a baseline	Run a benchmark before any fine-tuning
You’ve just finished a training run	Run the same benchmark again and compare
Your agent is returning bad answers	Export failing chat examples as a benchmark
You want to compare two models	Run both against the same benchmark and diff the results

What Eval Hands Off

→ Train: export failing examples as a labeled dataset to kick off a fine-tune.
→ Agents: benchmark results tell you which model to connect to your agent.

Get started

Agents

Eval

Train

Overview

Eval Overview

What Eval Does

When to Use Eval

What Eval Hands Off

Next Steps

Get started

Agents

Eval

Train

Documentation Index

​Eval Overview

​What Eval Does

​When to Use Eval

​What Eval Hands Off

​Next Steps

Eval Overview

What Eval Does

When to Use Eval

What Eval Hands Off

Next Steps