Read Results

After a benchmark run completes, Eval generates a structured results report. This page explains what each section means and how to use it.

Results Report Structure

Summary metrics

Metric	What it means
Accuracy	Percentage of test cases where the model’s response matched the expected answer
Avg latency	Mean response time per question in milliseconds
Avg cost	Mean token cost per question (API models only)
Pass / Fail	Count of passed and failed cases

Per-question breakdown

Each test case shows:

The input prompt
The model’s response
The expected answer
Pass / Fail status
Latency and token usage

Failure analysis

Eval groups failing cases by error pattern (wrong format, factual error, refusal, hallucination) to help you identify the most impactful issues to fix.

Comparing Runs

Select two or more runs from the run history to view a side-by-side diff. Useful for measuring improvement after a fine-tune.

Next Steps

Export failing cases to Train

Get started

Agents

Eval

Train

Read Results

Read Results

Results Report Structure

Summary metrics

Per-question breakdown

Failure analysis

Comparing Runs

Next Steps

Get started

Agents

Eval

Train

Documentation Index

​Read Results

​Results Report Structure

​Summary metrics

​Per-question breakdown

​Failure analysis

​Comparing Runs

​Next Steps

Read Results

Results Report Structure

Summary metrics

Per-question breakdown

Failure analysis

Comparing Runs

Next Steps