What Is an Environment?
An environment bundles everything needed to run a reproducible evaluation:| Component | What it provides |
|---|---|
| Dataset | The test cases — inputs and expected outputs |
| Harness | The execution logic — how to run the model on each case |
| Scoring rules | How a response is judged — exact match, LLM-as-judge, custom rubric |
Why Environments?
The current ecosystem for model evaluation has a few persistent problems we built BenchGen to address:- No shared platform for eval environments. Popular suites like
lm_eval,lighteval, andHELMcover single-turn Q&A well, but lack support for agentic tasks or evaluations that require real infrastructure — think tool-use benchmarks, multi-step reasoning chains, or code execution. The result is a proliferation of independent eval repos with no shared spec. - Evals and RL environments are the same thing, but treated as separate. Both are just a dataset, a harness, and scoring rules. Treating them differently creates duplicated work and fragmented tooling.
- Environment implementations are hard to reuse. Most eval setups are tightly coupled to a specific framework or repo structure, making them difficult to adapt, version, or share.
The Environments Hub
The Environments Hub is BenchGen’s catalog of ready-to-use evaluation environments, covering common task categories including instruction following, code generation, reasoning, tool use, and domain-specific knowledge. Browse the hub, pick the environment that fits your task, and run it against your model in one click. You can also upload a custom environment if your use case isn’t covered.Creating your own environment
If the hub doesn’t have what you need, you can package your own evaluation tasks as a.zip bundle and upload it directly. BenchGen validates the bundle and makes your environment available for runs immediately.
Ready to build one? See Create a custom environment for a step-by-step walkthrough.
What Eval Does
Select an environment and a model, then click Run. Eval executes the model against every test case in the environment, scores each response, and aggregates the results into a report. You get:- A per-question pass/fail breakdown
- Aggregate accuracy, latency, and cost metrics
- Exportable failure cases ready for fine-tuning
When to Use Eval
| Situation | What to do |
|---|---|
| You have a new base model and want a baseline | Pick an environment from the hub and run it before any fine-tuning |
| You’ve just finished a training run | Re-run the same environment and compare scores |
| Your agent is returning bad answers | Export failing cases as a dataset and kick off a fine-tune |
| You want to compare two models | Run both against the same environment and diff the results |
| Your task isn’t in the hub | Upload a custom environment with your own dataset and scoring rules |