Navigation: Models Console → Deployments Applies to: Current production version Last updated: 2026-04-20
Table of Contents
- What are Deployments
- Quick Start (Go live in 5 minutes)
- Creating a Deployment
- 3.1 Naming
- 3.2 Selecting a Model
- 3.3 Selecting a GPU Instance
- 3.4 Configuring Autoscaling
- 3.5 Engine Settings (Advanced)
- Deployment Lifecycle & Status
- Autoscaling In Depth
- LoRA Adapter Support
- Billing
- FAQ & Troubleshooting
1. What are Deployments
A Deployment is Novita’s dedicated AI inference endpoint product. Unlike Serverless endpoints that share compute resources with other users, each Deployment gives you:- Exclusive GPU: All compute resources are yours alone — no noisy neighbors
- Predictable Performance SLA: Dedicated compute means consistent, foreseeable inference latency
- Flexible Model Sources: Deploy any model from Hugging Face or the Novita model catalog
- OpenAI-Compatible Chat API: For pure text inference, simply swap
base_urlandmodelto migrate existing OpenAI integrations - Per-second Billing: You are only charged while the endpoint is active. Billing pauses automatically when Scale-to-Zero kicks in
| Use Case | Why it fits |
|---|---|
| Production API services | Stable latency, fully isolated from other users |
| Private or fine-tuned model serving | Deploy any custom HuggingFace model |
| High-concurrency inference | Scale to multiple replicas automatically |
| Cost-sensitive workloads | Scale-to-Zero stops billing during idle periods |
2. Quick Start (Go live in 5 minutes)
Step 1 — Navigate to Deployments Log in to Novita → left sidebar → Models Console → Models APIs → Deployments Step 2 — Create a Deployment Click + New Deployment and fill in:- A Deployment name (e.g.
my-llama3-endpoint) - Model source (the Novita model catalog is recommended for the fastest setup)
- GPU instance (the system auto-recommends a suitable spec for your model)
- Autoscaling settings
- Requesting GPU
- Downloading Model
- Engine Initializing
Manage your API Keys under Settings → API Keys.
3. Creating a Deployment
Click + New Deployment to open the creation form, which has four configuration sections.3.1 Naming
Recommended naming format:{model}-{environment}-{purpose} — e.g. llama3-prod-chatbot.
3.2 Selecting a Model
Two model sources are supported:Novita Model Catalog (Recommended)
Choose from Novita’s hosted model list — no token required, works out of the box. Covers all major open-source models (Llama 3, Qwen, DeepSeek, Mistral, and more).Novita pre-validates model compatibility and applies engine optimizations, resulting in faster startup and higher stability.
Hugging Face Models
Enter a HuggingFace repository ID (e.g.meta-llama/Meta-Llama-3-8B-Instruct).
- Public models: No token needed, deploy directly
- Private or Gated models: A HuggingFace Access Token must be linked first
- Go to HuggingFace → Settings → Access Tokens and create a token
- In the Model field on the Create Deployment form, click Integrate HF Token
- Paste and save the token
If your token expires or is revoked, active Deployments that rely on it will fail to re-pull the model. Keep your token up to date.
LoRA Adapter (Optional)
After selecting a Base Model, you can attach one or more LoRA Adapters from HuggingFace. Multiple adapters can run on the same Deployment without requiring additional GPU resources. See Section 6 — LoRA Adapter Support for details. Model file format requirements (for custom HuggingFace models):3.3 Selecting a GPU Instance
The system automatically recommends a GPU configuration based on your model size.
TIGHT MEMORY warning: If the selected GPU has limited VRAM for the chosen model, the system shows a TIGHT MEMORY warning. Increase the GPU count or contact Novita support.
GPU type cannot be changed after a Deployment is created. To switch GPU type, delete and recreate the Deployment.
3.4 Configuring Autoscaling
Autoscaling controls how many replicas run in response to traffic.Enable Autoscaling (Recommended)
Use the dual-handle slider to set the replica range:| Parameter | Description | Default |
|---|---|---|
| Min Replicas | Minimum active replicas at all times. Set to 0 to enable Scale-to-Zero | 1 |
| Max Replicas | Maximum replicas during peak traffic | 3 |
| Scale-down Delay | Seconds to wait after traffic drops before scaling down (prevents flapping) | 300s (minimum) |
- After idling for longer than the Scale-down Delay, the Deployment enters SLEEPING status and billing pauses
- The first incoming request wakes it up automatically
- Cold start time: typically 5 minutes depending on model size
- ⚠️ Best suited for dev/test or low-frequency workloads. For production, keep Min Replicas ≥ 1
Disable Autoscaling
Runs a fixed number of replicas. Best for workloads with strict latency SLAs that cannot tolerate any scaling delay.3.5 Engine Settings (Advanced)
Novita supports two inference engines — vLLM and SGLang — matched automatically to your model. These settings are hidden by default during Deployment creation.Max Concurrency per Replica
Controls how many requests a single replica handles simultaneously.| Setting | Effect |
|---|---|
| Below recommended | Lower latency, but limited throughput |
| Equal to recommended | Optimal balance of throughput and latency (recommended) |
| Above recommended | Higher throughput, but increased per-request latency |
The system calculates a recommended value based on your GPU instance. Default is 16.
Suffix Decoding
N-gram based speculative decoding that pre-generates future tokens to speed up inference.- Most effective for highly predictable output formats (e.g. code generation, structured JSON)
- Provides limited benefit for free-form conversation; excessively high values may actually increase latency
4. Deployment Lifecycle & Status
State Transition Diagram
When billing starts: Only running replicas are billed. Instances still deploying and replicas still scaling up do not count toward charges.
5. Autoscaling In Depth
How It Works
Novita autoscaling monitors live traffic and dynamically adjusts replica count within the Min–Max range:- Scale-Up: Request queue backlog detected → add replicas → more GPUs handle requests in parallel
- Scale-Down: Traffic drops → wait for Scale-down Delay to expire → reduce replicas
- Scale-to-Zero: When Min Replicas = 0 and the Deployment has been idle past the delay, it enters SLEEPING and billing stops
Cost vs. Availability Trade-off
| Configuration | Cost | Availability | Best for |
|---|---|---|---|
| Min=0, Max=N | Lowest (no charge when idle) | Cold start delay (5 min) | Dev/test, low-frequency workloads |
| Min=1, Max=N | Medium | Always available, scales on demand | Most production workloads ✅ |
| Min=N, Max=N (fixed) | Highest | No scaling delay at all | Ultra-low latency SLA requirements |
Per-Replica Cost
Each additional replica adds cost at the same GPU rate as the base replica. Example: a 2× H100 Deployment that scales to 2 replicas doubles the GPU cost.Best Practices
- Set Min Replicas = 1 in production to avoid cold starts impacting end users
- The default Scale-down Delay of 300s (5 minutes) works well for most cases; increase it if your traffic is highly variable
- Set Max Replicas to no more than 1.5× your expected (peak QPS / per-replica QPS) to avoid unexpected cost spikes
6. LoRA Adapter Support
What is LoRA
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds lightweight adapter layers on top of a Base Model to customize it for specific tasks — without retraining the full model.Using LoRA in Novita Deployments
Adding adapters at creation time: In Create Deployment → Model field → after selecting a Base Model, click + Add Adapter and enter the LoRA adapter’s HuggingFace repository ID. Viewing adapters at runtime: In the Engine Configuration panel, a+N LoRA badge appears next to the Model ID. Hover over it to see the full list of attached adapters.
Multi-LoRA: Multiple Adapters on One Deployment
A single Deployment can run multiple LoRA adapters simultaneously. Specify which adapter to use per request via themodel field:
Multi-LoRA requires no extra GPU resources. All adapters share a single copy of the Base Model weights in memory.
7. Billing
Billing Unit
Charged by GPU-second: number of GPUs × seconds running × unit price.When Billing Starts and Stops
| Event | Details |
|---|---|
| Billing starts | After GPU allocation completes during DEPLOYING (i.e. when Downloading Model begins) |
| Billing stops | When the Deployment enters SLEEPING or TERMINATED status |
| Continuous billing | A RUNNING Deployment is billed even when it receives zero API requests |
GPU Pricing
For the latest pricing, refer to the Novita pricing page.
Billing Example
Scenario: A customer deploys model instance X on a single RTX 4090 (priced at $0.61/GPU/hour), with autoscaling set to Min=0, Max=5. Usage and charges for 9:00–10:00:- 9:00:00 – 9:15:40 — Instance is SLEEPING. Charge: $0.00
- 9:15:41 – 9:16:45 — 1 running replica serving traffic (65 seconds). Charge: (0.011**
- 9:16:46 – 10:00:00 — 2 running replicas serving traffic (1,994 seconds). Charge: (0.676**
Cost Control Tips
- Enable Scale-to-Zero (Min Replicas = 0) for low-frequency workloads — zero cost when idle
- Audit your Deployment list regularly and delete unused Deployments
- Cap Max Replicas conservatively to prevent unexpected cost spikes from runaway autoscaling
- TERMINATED status costs nothing — terminate and redeploy on demand
8. FAQ & Troubleshooting
Deployment Issues
Q: My Deployment has been stuck in DEPLOYING for a long time — what should I do?Requesting GPU: GPU resources may be constrained. Wait 5–10 minutes, or try a different GPU typeDownloading Model: Large models (70B+) can take 10+ minutes to downloadEngine Initializing: Should complete within 5 minutes under normal conditions
- Model is not in
.safetensorsformat (.binis not supported) - HuggingFace Token is invalid or lacks access to a gated model
- Insufficient GPU VRAM for the model (TIGHT MEMORY configuration)
- Model architecture is not yet supported
API Issues
Q: What do the common HTTP error codes mean?| Code | Cause | Resolution |
|---|---|---|
400 | Malformed request | Validate your request JSON; ensure all required fields are present |
401 | Missing or invalid API Key | Include a valid key in Authorization: Bearer <Key> |
403 | API Key lacks access to this endpoint | Confirm the key belongs to the same account that owns the Deployment |
404 | Wrong Endpoint URL or Model ID | Re-copy the URL and Model ID from the Quick Start panel |
422 | Invalid parameter value (e.g. max_tokens too large) | Adjust the parameter — try reducing max_tokens |
429 | Rate limit exceeded | Reduce request frequency, or contact Novita to raise your limit |
500 | Internal server error | Retry after a short wait; if it persists, contact Novita support |
Billing Issues
Q: Why am I being charged when there are no requests? A RUNNING Deployment continuously occupies GPU resources regardless of request volume. Fix: Enable Autoscaling and set Min Replicas = 0. The Deployment will automatically sleep and stop billing when idle. Q: How do I stop all charges completely? Two options:- Scale-to-Zero: Let autoscaling trigger naturally (requires Autoscaling on with Min = 0)
- Terminate: Click Terminate on the Deployment detail page to release the GPU immediately
Appendix: Glossary
| Term | Definition |
|---|---|
| Deployment | Novita’s dedicated inference endpoint product |
| Replica | A single running instance of the inference service; multiple replicas run in parallel |
| Scale-to-Zero | Setting Min Replicas to 0 so the endpoint sleeps when idle and billing stops |
| Scale-down Delay | Wait period before scaling down, preventing flapping on variable traffic |
| LoRA Adapter | Lightweight fine-tuning plugin layered on top of a Base Model |
| Endpoint URL | The API access address for this Deployment |
| Endpoint ID | Unique identifier for this Deployment |
| Base Model | The underlying foundation model being served |
| Max Concurrency | Maximum simultaneous requests a single replica handles |
| Suffix Decoding | N-gram speculative decoding to accelerate inference on predictable outputs |
| GPU-second | Billing unit: 1 GPU running for 1 second |
For support, contact the Novita team at: support@novita.ai