Skip to main content
After you train and save a model, you run inference on it by deploying it from its model card. Running a model loads it onto a GPU and exposes an OpenAI-compatible endpoint you can send requests to. This is the same Run Inference flow used for any model on BenchGen, so the full walkthrough lives in Deploy an inference model.

Prerequisites


Steps

1. Open the model and click Run Inference

Open the model’s card and click Run Inference in the top right. A model card with the Run Inference button in the top right

2. Configure and start

In the Run Inference dialog, pick a GPU and review the inference settings (max tokens, temperature, top P, context length, precision), then click Run model. The Run Inference dialog with GPU selection and inference settings

3. Wait for the endpoint to be ready

The status reads Deploying while the model starts up, then turns to running once it’s live. The model card’s endpoint panel fills in with the LiteLLM Name, Endpoint URL, and access token. Logs showing the application is ready and the status set to running
For the full deployment walkthrough, including the deployment logs and how to verify the model is serving, see Deploy an inference model.

Send requests

Once the status is running, the endpoint is OpenAI-compatible. Use the LiteLLM Name, Endpoint URL, and Token from the model card:
curl https://<your-endpoint-url>/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<litellm-name>",
    "messages": [
      {"role": "user", "content": "Your test prompt here"}
    ],
    "max_tokens": 256
  }'

Next Steps