> ## Documentation Index
> Fetch the complete documentation index at: https://benchgen.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> What Eval does, how environments work, and what it hands off to Train and Agents.

Eval is BenchGen's evaluation module. It gives you a structured, reproducible way to measure how well a model performs on a task — before you ship it inside an agent or invest in fine-tuning.

The core primitive in Eval is an **environment**: a self-contained package of test cases, an evaluation harness, and scoring rules. You pick an environment, point it at a model, and Eval does the rest.

***

## What Is an Environment?

An environment bundles everything needed to run a reproducible evaluation:

| Component         | What it provides                                                    |
| ----------------- | ------------------------------------------------------------------- |
| **Dataset**       | The test cases — inputs and expected outputs                        |
| **Harness**       | The execution logic — how to run the model on each case             |
| **Scoring rules** | How a response is judged — exact match, LLM-as-judge, custom rubric |

Because all three components travel together, you can share, version, and re-run an environment exactly — on any model, at any point in time.

***

## Why Environments?

The current ecosystem for model evaluation has a few persistent problems we built BenchGen to address:

* **No shared platform for eval environments.** Popular suites like `lm_eval`, `lighteval`, and `HELM` cover single-turn Q\&A well, but lack support for agentic tasks or evaluations that require real infrastructure — think tool-use benchmarks, multi-step reasoning chains, or code execution. The result is a proliferation of independent eval repos with no shared spec.
* **Evals and RL environments are the same thing, but treated as separate.** Both are just a dataset, a harness, and scoring rules. Treating them differently creates duplicated work and fragmented tooling.
* **Environment implementations are hard to reuse.** Most eval setups are tightly coupled to a specific framework or repo structure, making them difficult to adapt, version, or share.

The BenchGen Environments Hub solves this by treating environments as first-class, versioned packages — each one ships with its own dataset, harness, and scoring rules, and can be run against any model without modification.

***

## The Environments Hub

The **Environments Hub** is BenchGen's catalog of ready-to-use evaluation environments, covering common task categories including instruction following, code generation, reasoning, tool use, and domain-specific knowledge.

Browse the hub, pick the environment that fits your task, and run it against your model in one click. You can also upload a custom environment if your use case isn't covered.

***

## Creating your own environment

If the hub doesn't have what you need, you can package your own evaluation tasks as a `.zip` bundle and upload it directly. BenchGen validates the bundle and makes your environment available for runs immediately.

<Info>
  **Ready to build one?** See [Create a custom environment](/eval/create-environment) for a step-by-step walkthrough.
</Info>

***

## What Eval Does

Select an environment and a model, then click **Run**. Eval executes the model against every test case in the environment, scores each response, and aggregates the results into a report.

You get:

* A per-question pass/fail breakdown
* Aggregate accuracy, latency, and cost metrics
* Exportable failure cases ready for fine-tuning

***

## When to Use Eval

| Situation                                     | What to do                                                          |
| --------------------------------------------- | ------------------------------------------------------------------- |
| You have a new base model and want a baseline | Pick an environment from the hub and run it before any fine-tuning  |
| You've just finished a training run           | Re-run the same environment and compare scores                      |
| Your agent is returning bad answers           | Export failing cases as a dataset and kick off a fine-tune          |
| You want to compare two models                | Run both against the same environment and diff the results          |
| Your task isn't in the hub                    | Upload a custom environment with your own dataset and scoring rules |

***

## Next Steps

* [Run a benchmark](/eval/run-a-benchmark)
* [Add a model](/eval/add-a-model)
* [Read results](/eval/read-results)
* [Export datasets to Train](/eval/export-datasets)
