Novita Deployments User Guide

Navigation: Models Console → Deployments Applies to: Current production version Last updated: 2026-04-20

What are Deployments
Quick Start (Go live in 5 minutes)
Creating a Deployment
- 3.1 Naming
- 3.2 Selecting a Model
- 3.3 Selecting a GPU Instance
- 3.4 Configuring Autoscaling
- 3.5 Engine Settings (Advanced)
Deployment Lifecycle & Status
Autoscaling In Depth
LoRA Adapter Support
Billing
FAQ & Troubleshooting

1. What are Deployments

A Deployment is Novita’s dedicated AI inference endpoint product. Unlike Serverless endpoints that share compute resources with other users, each Deployment gives you:

Exclusive GPU: All compute resources are yours alone — no noisy neighbors
Predictable Performance SLA: Dedicated compute means consistent, foreseeable inference latency
Flexible Model Sources: Deploy any model from Hugging Face or the Novita model catalog
OpenAI-Compatible Chat API: For pure text inference, simply swap base_url and model to migrate existing OpenAI integrations
Per-second Billing: You are only charged while the endpoint is active. Billing pauses automatically when Scale-to-Zero kicks in

When to use Deployments:

Use Case	Why it fits
Production API services	Stable latency, fully isolated from other users
Private or fine-tuned model serving	Deploy any custom HuggingFace model
High-concurrency inference	Scale to multiple replicas automatically
Cost-sensitive workloads	Scale-to-Zero stops billing during idle periods

2. Quick Start (Go live in 5 minutes)

Step 1 — Navigate to Deployments Log in to Novita → left sidebar → Models Console → Models APIs → Deployments Step 2 — Create a Deployment Click + New Deployment and fill in:

A Deployment name (e.g. my-llama3-endpoint)
Model source (the Novita model catalog is recommended for the fastest setup)
GPU instance (the system auto-recommends a suitable spec for your model)
Autoscaling settings

Step 3 — Wait for the Deployment to start Startup time varies with model size, typically 5–60 minutes, progressing through three phases:

Requesting GPU
Downloading Model
Engine Initializing

Once the status shows RUNNING, the endpoint is ready to receive requests. Step 4 — Call the API Go to the Deployment detail page → Quick Start panel → copy the ready-to-run code snippet.

Manage your API Keys under Settings → API Keys.

3. Creating a Deployment

Click + New Deployment to open the creation form, which has four configuration sections.

3.1 Naming

Recommended naming format: {model}-{environment}-{purpose} — e.g. llama3-prod-chatbot.

3.2 Selecting a Model

Two model sources are supported:

Novita Model Catalog (Recommended)

Choose from Novita’s hosted model list — no token required, works out of the box. Covers all major open-source models (Llama 3, Qwen, DeepSeek, Mistral, and more).

Novita pre-validates model compatibility and applies engine optimizations, resulting in faster startup and higher stability.

Hugging Face Models

Enter a HuggingFace repository ID (e.g. meta-llama/Meta-Llama-3-8B-Instruct).

Public models: No token needed, deploy directly
Private or Gated models: A HuggingFace Access Token must be linked first

How to link your HF Token:

Go to HuggingFace → Settings → Access Tokens and create a token
In the Model field on the Create Deployment form, click Integrate HF Token
Paste and save the token

If your token expires or is revoked, active Deployments that rely on it will fail to re-pull the model. Keep your token up to date.

LoRA Adapter (Optional)

After selecting a Base Model, you can attach one or more LoRA Adapters from HuggingFace. Multiple adapters can run on the same Deployment without requiring additional GPU resources. See Section 6 — LoRA Adapter Support for details. Model file format requirements (for custom HuggingFace models):

3.3 Selecting a GPU Instance

The system automatically recommends a GPU configuration based on your model size.

TIGHT MEMORY warning: If the selected GPU has limited VRAM for the chosen model, the system shows a TIGHT MEMORY warning. Increase the GPU count or contact Novita support.

GPU type cannot be changed after a Deployment is created. To switch GPU type, delete and recreate the Deployment.

3.4 Configuring Autoscaling

Autoscaling controls how many replicas run in response to traffic.

Enable Autoscaling (Recommended)

Use the dual-handle slider to set the replica range:

Parameter	Description	Default
Min Replicas	Minimum active replicas at all times. Set to 0 to enable Scale-to-Zero	1
Max Replicas	Maximum replicas during peak traffic	3
Scale-down Delay	Seconds to wait after traffic drops before scaling down (prevents flapping)	300s (minimum)

Scale-to-Zero (Min Replicas = 0):

After idling for longer than the Scale-down Delay, the Deployment enters SLEEPING status and billing pauses
The first incoming request wakes it up automatically
Cold start time: typically 5 minutes depending on model size
⚠️ Best suited for dev/test or low-frequency workloads. For production, keep Min Replicas ≥ 1

Disable Autoscaling

Runs a fixed number of replicas. Best for workloads with strict latency SLAs that cannot tolerate any scaling delay.

3.5 Engine Settings (Advanced)

Novita supports two inference engines — vLLM and SGLang — matched automatically to your model. These settings are hidden by default during Deployment creation.

Max Concurrency per Replica

Controls how many requests a single replica handles simultaneously.

Setting	Effect
Below recommended	Lower latency, but limited throughput
Equal to recommended	Optimal balance of throughput and latency (recommended)
Above recommended	Higher throughput, but increased per-request latency

The system calculates a recommended value based on your GPU instance. Default is 16.

Suffix Decoding

N-gram based speculative decoding that pre-generates future tokens to speed up inference.

Most effective for highly predictable output formats (e.g. code generation, structured JSON)
Provides limited benefit for free-form conversation; excessively high values may actually increase latency

4. Deployment Lifecycle & Status

State Transition Diagram

Create
  │
  ▼
PENDING ──── Waiting for GPU resource allocation
  │
  ▼
DEPLOYING ── Three sub-phases:
  │            ├─ Requesting GPU
  │            ├─ Downloading Model
  │            └─ Engine Initializing
  │
  ├──────────────── FAILED (deployment failed)
  │
  ▼
RUNNING ──── Live and accepting requests
  │
  ├─ Zero traffic + Scale-to-Zero enabled ──► SLEEPING
  │                                               │
  │                                 First request ──► DEPLOYING ──► RUNNING
  │
  ├─ Config update ──► ROLLING (zero-downtime rolling update)
  │
  ├─ Traffic change ──► SCALING (autoscaling in progress)
  │
  └─ Manual terminate ──► TERMINATING ──► TERMINATED (can be redeployed or deleted)

When billing starts: Only running replicas are billed. Instances still deploying and replicas still scaling up do not count toward charges.

5. Autoscaling In Depth

How It Works

Novita autoscaling monitors live traffic and dynamically adjusts replica count within the Min–Max range:

Scale-Up: Request queue backlog detected → add replicas → more GPUs handle requests in parallel
Scale-Down: Traffic drops → wait for Scale-down Delay to expire → reduce replicas
Scale-to-Zero: When Min Replicas = 0 and the Deployment has been idle past the delay, it enters SLEEPING and billing stops

Cost vs. Availability Trade-off

Configuration	Cost	Availability	Best for
Min=0, Max=N	Lowest (no charge when idle)	Cold start delay (5 min)	Dev/test, low-frequency workloads
Min=1, Max=N	Medium	Always available, scales on demand	Most production workloads ✅
Min=N, Max=N (fixed)	Highest	No scaling delay at all	Ultra-low latency SLA requirements

Per-Replica Cost

Each additional replica adds cost at the same GPU rate as the base replica. Example: a 2× H100 Deployment that scales to 2 replicas doubles the GPU cost.

Best Practices

Set Min Replicas = 1 in production to avoid cold starts impacting end users
The default Scale-down Delay of 300s (5 minutes) works well for most cases; increase it if your traffic is highly variable
Set Max Replicas to no more than 1.5× your expected (peak QPS / per-replica QPS) to avoid unexpected cost spikes

6. LoRA Adapter Support

What is LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds lightweight adapter layers on top of a Base Model to customize it for specific tasks — without retraining the full model.

Using LoRA in Novita Deployments

Adding adapters at creation time: In Create Deployment → Model field → after selecting a Base Model, click + Add Adapter and enter the LoRA adapter’s HuggingFace repository ID. Viewing adapters at runtime: In the Engine Configuration panel, a +N LoRA badge appears next to the Model ID. Hover over it to see the full list of attached adapters.

Multi-LoRA: Multiple Adapters on One Deployment

A single Deployment can run multiple LoRA adapters simultaneously. Specify which adapter to use per request via the model field:

Multi-LoRA requires no extra GPU resources. All adapters share a single copy of the Base Model weights in memory.

7. Billing

Billing Unit

Charged by GPU-second: number of GPUs × seconds running × unit price.

When Billing Starts and Stops

Event	Details
Billing starts	After GPU allocation completes during DEPLOYING (i.e. when Downloading Model begins)
Billing stops	When the Deployment enters SLEEPING or TERMINATED status
Continuous billing	A RUNNING Deployment is billed even when it receives zero API requests

GPU Pricing

For the latest pricing, refer to the Novita pricing page.

Billing Example

Scenario: A customer deploys model instance X on a single RTX 4090 (priced at $0.61/GPU/hour), with autoscaling set to Min=0, Max=5. Usage and charges for 9:00–10:00:

9:00:00 – 9:15:40 — Instance is SLEEPING. Charge: $0.00
9:15:41 – 9:16:45 — 1 running replica serving traffic (65 seconds). Charge: ( $0.61 ÷ 3600) × 1 replica × 65s = **$ 0.011**
9:16:46 – 10:00:00 — 2 running replicas serving traffic (1,994 seconds). Charge: ( $0.61 ÷ 3600) × 2 replicas × 1,994s = **$ 0.676**

Total for 9:00–10:00: $0 +$ 0.011 + $0.676 =$ 0.687

Cost Control Tips

Enable Scale-to-Zero (Min Replicas = 0) for low-frequency workloads — zero cost when idle
Audit your Deployment list regularly and delete unused Deployments
Cap Max Replicas conservatively to prevent unexpected cost spikes from runaway autoscaling
TERMINATED status costs nothing — terminate and redeploy on demand

8. FAQ & Troubleshooting

Deployment Issues

Q: My Deployment has been stuck in DEPLOYING for a long time — what should I do?

Requesting GPU: GPU resources may be constrained. Wait 5–10 minutes, or try a different GPU type
Downloading Model: Large models (70B+) can take 10+ minutes to download
Engine Initializing: Should complete within 5 minutes under normal conditions

Q: My Deployment shows FAILED — what are the common causes?

Model is not in .safetensors format (.bin is not supported)
HuggingFace Token is invalid or lacks access to a gated model
Insufficient GPU VRAM for the model (TIGHT MEMORY configuration)
Model architecture is not yet supported

Debugging steps: check the change log in the Settings Tab → verify model file format → validate the HF Token → increase GPU count and recreate the Deployment. Q: My Deployment is SLEEPING — how do I wake it up? Send any API request to it. The Deployment wakes up automatically. The first request waits for the cold start to complete before receiving a response.

API Issues

Q: What do the common HTTP error codes mean?

Code	Cause	Resolution
`400`	Malformed request	Validate your request JSON; ensure all required fields are present
`401`	Missing or invalid API Key	Include a valid key in `Authorization: Bearer <Key>`
`403`	API Key lacks access to this endpoint	Confirm the key belongs to the same account that owns the Deployment
`404`	Wrong Endpoint URL or Model ID	Re-copy the URL and Model ID from the Quick Start panel
`422`	Invalid parameter value (e.g. max_tokens too large)	Adjust the parameter — try reducing max_tokens
`429`	Rate limit exceeded	Reduce request frequency, or contact Novita to raise your limit
`500`	Internal server error	Retry after a short wait; if it persists, contact Novita support

Q: Where do I find my API Key? Go to Settings → API Keys to create or manage keys. A key is only shown once at creation — save it immediately.

Billing Issues

Q: Why am I being charged when there are no requests? A RUNNING Deployment continuously occupies GPU resources regardless of request volume. Fix: Enable Autoscaling and set Min Replicas = 0. The Deployment will automatically sleep and stop billing when idle. Q: How do I stop all charges completely? Two options:

Scale-to-Zero: Let autoscaling trigger naturally (requires Autoscaling on with Min = 0)
Terminate: Click Terminate on the Deployment detail page to release the GPU immediately

Appendix: Glossary

Term	Definition
Deployment	Novita’s dedicated inference endpoint product
Replica	A single running instance of the inference service; multiple replicas run in parallel
Scale-to-Zero	Setting Min Replicas to 0 so the endpoint sleeps when idle and billing stops
Scale-down Delay	Wait period before scaling down, preventing flapping on variable traffic
LoRA Adapter	Lightweight fine-tuning plugin layered on top of a Base Model
Endpoint URL	The API access address for this Deployment
Endpoint ID	Unique identifier for this Deployment
Base Model	The underlying foundation model being served
Max Concurrency	Maximum simultaneous requests a single replica handles
Suffix Decoding	N-gram speculative decoding to accelerate inference on predictable outputs
GPU-second	Billing unit: 1 GPU running for 1 second

For support, contact the Novita team at: support@novita.ai

​Table of Contents

​1. What are Deployments

​2. Quick Start (Go live in 5 minutes)

​3. Creating a Deployment

​3.1 Naming

​3.2 Selecting a Model

​Novita Model Catalog (Recommended)

​Hugging Face Models

​LoRA Adapter (Optional)

​3.3 Selecting a GPU Instance

​3.4 Configuring Autoscaling

​Enable Autoscaling (Recommended)

​Disable Autoscaling

​3.5 Engine Settings (Advanced)

​Max Concurrency per Replica

​Suffix Decoding

​4. Deployment Lifecycle & Status

​State Transition Diagram

​5. Autoscaling In Depth

​How It Works

​Cost vs. Availability Trade-off

​Per-Replica Cost

​Best Practices

​6. LoRA Adapter Support

​What is LoRA

​Using LoRA in Novita Deployments

​Multi-LoRA: Multiple Adapters on One Deployment

​7. Billing

​Billing Unit

​When Billing Starts and Stops

​GPU Pricing

​Billing Example

​Cost Control Tips

​8. FAQ & Troubleshooting

​Deployment Issues

​API Issues

​Billing Issues

​Appendix: Glossary

Table of Contents