Create a Managed Inference Job (vLLM)

Create a vLLM Managed Inference Job interactively with the CLI, then confirm it.

Create a Managed Inference Job from the CLI. Answer the prompts, and CosmicAC deploys the language model behind an OpenAI-compatible chat endpoint.

You need the following before you start:

Start the interactive job setup:

cosmicac jobs create

Select Managed Inference (vLLM) as the job type, then set these fields:

Job name — a name to identify the job.
Tags — comma-separated labels for the job.
Location — the region where the job runs.
GPU type — the GPU to use. The CLI lists the GPUs available in your location.
GPU count — the number of GPUs.
Model — the model to serve. Select a listed model, or enter a Hugging Face model ID to bring your own.
Data type — the numeric precision the model runs at.
Quantisation — how to compress the model weights.
Tensor parallel — how many GPUs to split the model across.
GPU memory utilization — the fraction of GPU memory to use.
Max model length — the maximum context length.
Max concurrent sequences — the maximum requests handled at once.
Reasoning parser — the parser for reasoning output.
Video & image input — whether the model accepts multimodal input.
Endpoint name — a name for the endpoint, used in its URL path.
Replicas — how many copies of the model to run.
Require Authorization header — whether callers must send an API key. See Create an API key.
Root disk size — the VM root disk size in GB. Choose 250, 500, or 1000.
Environment variables — optional variables passed to the serving container.

You can serve any model that vLLM supports. Browse the Hugging Face model hub or the vLLM supported models list to find one.

To serve a speech-to-text model instead, see Create a Managed Inference Job (Parakeet).

The Job configuration reference describes each field, and Recommended model parameters lists recommended values for supported models.

CosmicAC creates the job and prints its ID.

Check that your endpoint is serving:

cosmicac models healthcheck

Your endpoint appears as Endpoint: <endpoint-name>.