Documentation Index
Fetch the complete documentation index at: https://docs.benchgen.com/llms.txt
Use this file to discover all available pages before exploring further.
Eval Overview
Eval is BenchGen’s benchmarking module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning.What Eval Does
Upload or connect a model, point it at a benchmark, and Eval runs the model against every test case. It scores each response, aggregates the results into a report, and flags the failure patterns that matter most. You get:- A per-question pass/fail breakdown
- Aggregate accuracy, latency, and cost metrics
- Exportable failure cases ready for fine-tuning
When to Use Eval
| Situation | What to do |
|---|---|
| You have a new base model and want a baseline | Run a benchmark before any fine-tuning |
| You’ve just finished a training run | Run the same benchmark again and compare |
| Your agent is returning bad answers | Export failing chat examples as a benchmark |
| You want to compare two models | Run both against the same benchmark and diff the results |
What Eval Hands Off
- → Train: export failing examples as a labeled dataset to kick off a fine-tune.
- → Agents: benchmark results tell you which model to connect to your agent.