Read Results - BenchGen

After a benchmark run completes, Eval generates a structured results report. This page explains what each section means and how to use it.

Results Report Structure

Summary metrics

Metric	What it means
Accuracy	Percentage of test cases where the model’s response matched the expected answer
Avg latency	Mean response time per question in milliseconds
Avg cost	Mean token cost per question (API models only)
Pass / Fail	Count of passed and failed cases

Per-question breakdown

Each test case shows:

The input prompt
The model’s response
The expected answer
Pass / Fail status
Latency and token usage

Failure analysis

Eval groups failing cases by error pattern (wrong format, factual error, refusal, hallucination) to help you identify the most impactful issues to fix.

Comparing Runs

Select two or more runs from the run history to view a side-by-side diff. Useful for measuring improvement after a fine-tune.

Next Steps

Export failing cases to Train

​Results Report Structure

​Summary metrics

​Per-question breakdown

​Failure analysis

​Comparing Runs

​Next Steps