Skip to main content
Once a model is serving traffic, the Usage tab on its model card is where you monitor performance and API usage. It summarizes requests, token counts, latency, and spend over a time window, and lets you drill into any individual request to see its full prompt, response, and metrics.
The Usage tab is available on any deployed model. If you don’t have one yet, see Deploy an inference model.

Steps

1. Open the Usage tab

Open a deployed model’s card and select the Usage tab. Use the Last 7 Days dropdown to change the time window, Refresh to pull the latest data, or Live to stream new requests as they arrive. The Usage tab with summary metric cards and the request log

2. Read the summary metrics

Four cards summarize activity for the selected window:
CardWhat it shows
RequestsTotal requests, with the success rate below.
TokensTotal tokens, split into input (prompt) and output (completion).
SpendTotal cost across all requests.
Avg LatencyMean response time, with the p95 latency below it.

3. Browse the request log

The Requests table lists every call to the endpoint, newest first. It’s paginated, so use the arrows to page through history.
ColumnMeaning
TimeWhen the request was made.
Statussuccess or error.
PromptInput (prompt) tokens.
CompletionOutput (completion) tokens.
TotalTotal tokens for the request.
LatencyResponse time in milliseconds.
SpendCost of the request.
RequestThe request ID (for example chatcmpl-…).

4. Inspect a single request

Click any row to open its detail drawer. This is where you debug a specific call. A single request's details, tags, and metrics
SectionWhat’s inside
Request DetailsModel, provider, call type, model ID, API base, and (redacted) IP address.
MetricsToken split, cost, duration, time to first token, cache hit, LiteLLM overhead, retries, and start/end times.
Cost BreakdownThe cost components that add up to the request total.
Request & ResponseThe full exchange, with a Pretty / JSON toggle. Shows the system and user input and the assistant output.
MetadataAdditional metadata captured for the request.
The request and response payloads for a single call

5. Watch live traffic

Toggle Live to auto-refresh the tab every 15 seconds. A banner confirms live mode is on; click Stop to pause it. This is useful while load-testing or watching a benchmark run drive requests. Live mode auto-refreshing the Usage tab

Next Steps

Usage data comes from real traffic. Deploy a model to start serving requests, or run an evaluation against it to generate activity you can watch here in real time.

Deploy an inference model

Spin up a live, OpenAI-compatible endpoint that records usage.

Evaluate an inference model

Benchmark a running model and watch the requests stream in.