Skip to main content
Once a model is running as an inference endpoint, you can benchmark it against any environment in Eval. BenchGen runs the model over every test case in the environment, scores each response, and produces a results report with downloadable predictions and a breakdown for every item.
Need a running model first? This guide assumes you already have a live endpoint. If you don’t, follow Deploy an inference model and come back once its status reads running.

Prerequisites


Steps

1. Open a benchmark

Open the environment you want to evaluate against. Its Overview tab describes what the benchmark measures and how submissions are scored, and shows tabs for Phases, Leaderboard, Evaluations, and Evaluate. To start, click Evaluate in the top right corner, or open the Evaluate tab. The GSM8K-TR benchmark overview page with the Evaluate button

2. Choose a model source

The Evaluate tab opens with “Select a model to evaluate.” Models are grouped by source. Pick the tab that matches where your model lives:
SourceUse when
Platform ModelsThe model is already published or deployed on BenchGen.
RunningThe model is a live endpoint you started with Run Inference.
TrainedThe model is a checkpoint you fine-tuned in Train.
HuggingFaceYou want to pull a public model from the HuggingFace Hub.
The Evaluate tab showing the four model source tabs

3. Select your running model

Since you just deployed an endpoint, click the Running tab. It lists every model that is currently live. Find the one you deployed. It shows a green running badge. The Running tab listing the live demoaccount-gemma4 endpoint Click the model to select it. A checkmark appears and the Run Evaluation button becomes active. The running model selected, with Run Evaluation now enabled

4. Run the evaluation

Click Run Evaluation. BenchGen creates an evaluation run, generates a submission for the selected environment, and starts running your model against the test cases.

5. Monitor progress

The run opens to a live log view. Status messages stream as the run progresses: it loads the benchmark data, runs the model on each item, and reports progress like Processing: 10/100 (10%). The evaluation running with live logs and the evaluation details panel The Evaluation details panel on the right sums up the run:
FieldMeaning
StatusRunning while in progress, Completed when finished.
ModelThe model being evaluated.
EnvironmentThe benchmark the model is scored against.
CreatedWhen the run started.
SubmissionThe generated submission archive for this run.
Run IDUnique identifier for the run, for example #678.
As the run nears completion, the logs show predictions being generated and the final score being computed, for example accuracy=26.00% correct=26/100. Completed logs showing generated predictions and the computed accuracy

6. Review scores and results

When the run finishes, the status turns to Completed and a Score Breakdown replaces the live logs. The Score Breakdown and detailed results for the completed run The headline Overall Score sits at the top, followed by the individual metrics:
MetricMeaning
Overall ScoreThe headline score for the run.
accuracyPercentage of items answered correctly.
exact matchPercentage of responses that matched the expected answer exactly.
answer ratePercentage of items the model produced any answer for.
durationWall-clock time for the full run.
Below the metrics, Detailed Results shows a table with one row per test case. Each row lists the item ID, the gold (expected) answer, the model’s prediction, and whether it was correct.

Download the artifacts

The Files section in the panel on the right lets you download everything the run produced:
FileContents
Submission FileThe original archive submitted to the benchmark.
Prediction OutputThe model’s raw predictions for every item.
Scoring OutputThe output emitted by the environment’s scoring program.
Detailed ResultsThe full breakdown for every item. View it in the browser or download it as HTML or JSON.
Generated datasetWhen the environment emits one, a fine-tune dataset you can open with View fine-tune dataset and carry into Train.
Failing cases captured in the generated dataset are exactly what you feed into a fine-tune. See Export datasets to Train to turn this run’s misses into your next training set.

Next Steps