> ## Documentation Index
> Fetch the complete documentation index at: https://benchgen.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Inference

> Serve a model you trained as a live, OpenAI-compatible endpoint and send it requests.

After you train and save a model, you run inference on it by deploying it from its model card. Running a model loads it onto a GPU and exposes an OpenAI-compatible endpoint you can send requests to. This is the same **Run Inference** flow used for any model on BenchGen, so the full walkthrough lives in [Deploy an inference model](/eval/run-an-inference-model).

***

## Prerequisites

* A saved model from a training run (see [Merge & save a model](/train/merge-lora-adapter)), or any model in your workspace (see [Add a model](/eval/add-a-model)).
* A GPU node with free capacity.

***

## Steps

### 1. Open the model and click Run Inference

Open the model's card and click **Run Inference** in the top right.

<img src="https://mintcdn.com/benchgen-8fc81371/QVPnryBJnoTcYyay/images/eval/inference/02-model-card.jpg?fit=max&auto=format&n=QVPnryBJnoTcYyay&q=85&s=cca86164a9bee3e1424992a6009f455a" alt="A model card with the Run Inference button in the top right" width="1478" height="941" data-path="images/eval/inference/02-model-card.jpg" />

### 2. Configure and start

In the **Run Inference** dialog, pick a **GPU** and review the inference settings (max tokens, temperature, top P, context length, precision), then click **Run model**.

<img src="https://mintcdn.com/benchgen-8fc81371/QVPnryBJnoTcYyay/images/eval/inference/03-run-inference-config.jpg?fit=max&auto=format&n=QVPnryBJnoTcYyay&q=85&s=01851f6245aa2002638f0063beaf5072" alt="The Run Inference dialog with GPU selection and inference settings" width="1478" height="941" data-path="images/eval/inference/03-run-inference-config.jpg" />

### 3. Wait for the endpoint to be ready

The status reads **Deploying** while the model starts up, then turns to **running** once it's live. The model card's endpoint panel fills in with the LiteLLM Name, Endpoint URL, and access token.

<img src="https://mintcdn.com/benchgen-8fc81371/QVPnryBJnoTcYyay/images/eval/inference/06-deployment-ready.jpg?fit=max&auto=format&n=QVPnryBJnoTcYyay&q=85&s=87da5879d07297cab887e9d50b89320c" alt="Logs showing the application is ready and the status set to running" width="1478" height="941" data-path="images/eval/inference/06-deployment-ready.jpg" />

<Tip>
  For the full deployment walkthrough, including the deployment logs and how to verify the model is serving, see [Deploy an inference model](/eval/run-an-inference-model).
</Tip>

***

## Send requests

Once the status is **running**, the endpoint is OpenAI-compatible. Use the **LiteLLM Name**, **Endpoint URL**, and **Token** from the model card:

```bash theme={null}
curl https://<your-endpoint-url>/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<litellm-name>",
    "messages": [
      {"role": "user", "content": "Your test prompt here"}
    ],
    "max_tokens": 256
  }'
```

***

## Next Steps

* [Monitor model usage](/eval/monitor-model-usage) to track requests, tokens, and latency.
* [Evaluate an inference model](/eval/evaluate-a-running-model) to benchmark it against an environment.
* [Run a benchmark](/eval/run-a-benchmark) to measure improvement formally.
