Results Report Structure
Summary metrics
| Metric | What it means |
|---|---|
| Accuracy | Percentage of test cases where the model’s response matched the expected answer |
| Avg latency | Mean response time per question in milliseconds |
| Avg cost | Mean token cost per question (API models only) |
| Pass / Fail | Count of passed and failed cases |
Per-question breakdown
Each test case shows:- The input prompt
- The model’s response
- The expected answer
- Pass / Fail status
- Latency and token usage