Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.benchgen.com/llms.txt

Use this file to discover all available pages before exploring further.

Read Results

After a benchmark run completes, Eval generates a structured results report. This page explains what each section means and how to use it.

Results Report Structure

Summary metrics

MetricWhat it means
AccuracyPercentage of test cases where the model’s response matched the expected answer
Avg latencyMean response time per question in milliseconds
Avg costMean token cost per question (API models only)
Pass / FailCount of passed and failed cases

Per-question breakdown

Each test case shows:
  • The input prompt
  • The model’s response
  • The expected answer
  • Pass / Fail status
  • Latency and token usage

Failure analysis

Eval groups failing cases by error pattern (wrong format, factual error, refusal, hallucination) to help you identify the most impactful issues to fix.

Comparing Runs

Select two or more runs from the run history to view a side-by-side diff. Useful for measuring improvement after a fine-tune.

Next Steps