Run Inference - BenchGen

After you train and save a model, you run inference on it by deploying it from its model card. Running a model loads it onto a GPU and exposes an OpenAI-compatible endpoint you can send requests to. This is the same Run Inference flow used for any model on BenchGen, so the full walkthrough lives in Deploy an inference model.

Prerequisites

A saved model from a training run (see Merge & save a model), or any model in your workspace (see Add a model).
A GPU node with free capacity.

Steps

1. Open the model and click Run Inference

Open the model’s card and click Run Inference in the top right.

A model card with the Run Inference button in the top right

2. Configure and start

In the Run Inference dialog, pick a GPU and review the inference settings (max tokens, temperature, top P, context length, precision), then click Run model.

The Run Inference dialog with GPU selection and inference settings

3. Wait for the endpoint to be ready

The status reads Deploying while the model starts up, then turns to running once it’s live. The model card’s endpoint panel fills in with the LiteLLM Name, Endpoint URL, and access token.

Logs showing the application is ready and the status set to running

For the full deployment walkthrough, including the deployment logs and how to verify the model is serving, see Deploy an inference model.

Send requests

Once the status is running, the endpoint is OpenAI-compatible. Use the LiteLLM Name, Endpoint URL, and Token from the model card:

curl https://<your-endpoint-url>/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<litellm-name>",
    "messages": [
      {"role": "user", "content": "Your test prompt here"}
    ],
    "max_tokens": 256
  }'

Next Steps

Monitor model usage to track requests, tokens, and latency.
Evaluate an inference model to benchmark it against an environment.
Run a benchmark to measure improvement formally.

Merge & Save a Model

​Prerequisites

​Steps

​1. Open the model and click Run Inference

​2. Configure and start

​3. Wait for the endpoint to be ready

​Send requests

​Next Steps

Prerequisites

Steps

1. Open the model and click Run Inference

2. Configure and start

3. Wait for the endpoint to be ready

Send requests

Next Steps