Overview - BenchGen

Eval is BenchGen’s evaluation module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning. The core primitive in Eval is an environment: a self-contained package of test cases, an evaluation harness, and scoring rules. You pick an environment, point it at a model, and Eval does the rest.

What Is an Environment?

An environment bundles everything needed to run a reproducible evaluation:

Component	What it provides
Dataset	The test cases — inputs and expected outputs
Harness	The execution logic — how to run the model on each case
Scoring rules	How a response is judged — exact match, LLM-as-judge, custom rubric

Because all three components travel together, you can share, version, and re-run an environment exactly — on any model, at any point in time.

Why Environments?

The current ecosystem for model evaluation has a few persistent problems we built BenchGen to address:

No shared platform for eval environments. Popular suites like lm_eval, lighteval, and HELM cover single-turn Q&A well, but lack support for agentic tasks or evaluations that require real infrastructure — think tool-use benchmarks, multi-step reasoning chains, or code execution. The result is a proliferation of independent eval repos with no shared spec.
Evals and RL environments are the same thing, but treated as separate. Both are just a dataset, a harness, and scoring rules. Treating them differently creates duplicated work and fragmented tooling.
Environment implementations are hard to reuse. Most eval setups are tightly coupled to a specific framework or repo structure, making them difficult to adapt, version, or share.

The BenchGen Environments Hub solves this by treating environments as first-class, versioned packages — each one ships with its own dataset, harness, and scoring rules, and can be run against any model without modification.

The Environments Hub

The Environments Hub is BenchGen’s catalog of ready-to-use evaluation environments, covering common task categories including instruction following, code generation, reasoning, tool use, and domain-specific knowledge. Browse the hub, pick the environment that fits your task, and run it against your model in one click. You can also upload a custom environment if your use case isn’t covered.

Creating your own environment

If the hub doesn’t have what you need, you can package your own evaluation tasks as a .zip bundle and upload it directly. BenchGen validates the bundle and makes your environment available for runs immediately.

Ready to build one? See Create a custom environment for a step-by-step walkthrough.

What Eval Does

Select an environment and a model, then click Run. Eval executes the model against every test case in the environment, scores each response, and aggregates the results into a report. You get:

A per-question pass/fail breakdown
Aggregate accuracy, latency, and cost metrics
Exportable failure cases ready for fine-tuning

When to Use Eval

Situation	What to do
You have a new base model and want a baseline	Pick an environment from the hub and run it before any fine-tuning
You’ve just finished a training run	Re-run the same environment and compare scores
Your agent is returning bad answers	Export failing cases as a dataset and kick off a fine-tune
You want to compare two models	Run both against the same environment and diff the results
Your task isn’t in the hub	Upload a custom environment with your own dataset and scoring rules

​What Is an Environment?

​Why Environments?

​The Environments Hub

​Creating your own environment

​What Eval Does

​When to Use Eval

​Next Steps

What Is an Environment?

Why Environments?

The Environments Hub

Creating your own environment

What Eval Does

When to Use Eval

Next Steps