Skip to main content
Eval is BenchGen’s evaluation module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning. The core primitive in Eval is an environment: a self-contained package of test cases, an evaluation harness, and scoring rules. You pick an environment, point it at a model, and Eval does the rest.

What Is an Environment?

An environment bundles everything needed to run a reproducible evaluation:
ComponentWhat it provides
DatasetThe test cases — inputs and expected outputs
HarnessThe execution logic — how to run the model on each case
Scoring rulesHow a response is judged — exact match, LLM-as-judge, custom rubric
Because all three components travel together, you can share, version, and re-run an environment exactly — on any model, at any point in time.

Why Environments?

The current ecosystem for model evaluation has a few persistent problems we built BenchGen to address:
  • No shared platform for eval environments. Popular suites like lm_eval, lighteval, and HELM cover single-turn Q&A well, but lack support for agentic tasks or evaluations that require real infrastructure — think tool-use benchmarks, multi-step reasoning chains, or code execution. The result is a proliferation of independent eval repos with no shared spec.
  • Evals and RL environments are the same thing, but treated as separate. Both are just a dataset, a harness, and scoring rules. Treating them differently creates duplicated work and fragmented tooling.
  • Environment implementations are hard to reuse. Most eval setups are tightly coupled to a specific framework or repo structure, making them difficult to adapt, version, or share.
The BenchGen Environments Hub solves this by treating environments as first-class, versioned packages — each one ships with its own dataset, harness, and scoring rules, and can be run against any model without modification.

The Environments Hub

The Environments Hub is BenchGen’s catalog of ready-to-use evaluation environments, covering common task categories including instruction following, code generation, reasoning, tool use, and domain-specific knowledge. Browse the hub, pick the environment that fits your task, and run it against your model in one click. You can also upload a custom environment if your use case isn’t covered.

Creating your own environment

If the hub doesn’t have what you need, you can package your own evaluation tasks as a .zip bundle and upload it directly. BenchGen validates the bundle and makes your environment available for runs immediately.
Ready to build one? See Create a custom environment for a step-by-step walkthrough.

What Eval Does

Select an environment and a model, then click Run. Eval executes the model against every test case in the environment, scores each response, and aggregates the results into a report. You get:
  • A per-question pass/fail breakdown
  • Aggregate accuracy, latency, and cost metrics
  • Exportable failure cases ready for fine-tuning

When to Use Eval

SituationWhat to do
You have a new base model and want a baselinePick an environment from the hub and run it before any fine-tuning
You’ve just finished a training runRe-run the same environment and compare scores
Your agent is returning bad answersExport failing cases as a dataset and kick off a fine-tune
You want to compare two modelsRun both against the same environment and diff the results
Your task isn’t in the hubUpload a custom environment with your own dataset and scoring rules

Next Steps