Skip to main content
Navigation: Models Console → Deployments Applies to: Current production version Last updated: 2026-04-20

Table of Contents

  1. What are Deployments
  2. Quick Start (Go live in 5 minutes)
  3. Creating a Deployment
  4. Deployment Lifecycle & Status
  5. Autoscaling In Depth
  6. LoRA Adapter Support
  7. Billing
  8. FAQ & Troubleshooting

1. What are Deployments

A Deployment is Novita’s dedicated AI inference endpoint product. Unlike Serverless endpoints that share compute resources with other users, each Deployment gives you:
  • Exclusive GPU: All compute resources are yours alone — no noisy neighbors
  • Predictable Performance SLA: Dedicated compute means consistent, foreseeable inference latency
  • Flexible Model Sources: Deploy any model from Hugging Face or the Novita model catalog
  • OpenAI-Compatible Chat API: For pure text inference, simply swap base_url and model to migrate existing OpenAI integrations
  • Per-second Billing: You are only charged while the endpoint is active. Billing pauses automatically when Scale-to-Zero kicks in
When to use Deployments:
Use CaseWhy it fits
Production API servicesStable latency, fully isolated from other users
Private or fine-tuned model servingDeploy any custom HuggingFace model
High-concurrency inferenceScale to multiple replicas automatically
Cost-sensitive workloadsScale-to-Zero stops billing during idle periods

2. Quick Start (Go live in 5 minutes)

Step 1 — Navigate to Deployments Log in to Novita → left sidebar → Models ConsoleModels APIsDeployments Step 2 — Create a Deployment Click + New Deployment and fill in:
  • A Deployment name (e.g. my-llama3-endpoint)
  • Model source (the Novita model catalog is recommended for the fastest setup)
  • GPU instance (the system auto-recommends a suitable spec for your model)
  • Autoscaling settings
Step 3 — Wait for the Deployment to start Startup time varies with model size, typically 5–60 minutes, progressing through three phases:
  1. Requesting GPU
  2. Downloading Model
  3. Engine Initializing
Once the status shows RUNNING, the endpoint is ready to receive requests. Step 4 — Call the API Go to the Deployment detail page → Quick Start panel → copy the ready-to-run code snippet.
Manage your API Keys under Settings → API Keys.

3. Creating a Deployment

Click + New Deployment to open the creation form, which has four configuration sections.

3.1 Naming

Recommended naming format: {model}-{environment}-{purpose} — e.g. llama3-prod-chatbot.

3.2 Selecting a Model

Two model sources are supported: Choose from Novita’s hosted model list — no token required, works out of the box. Covers all major open-source models (Llama 3, Qwen, DeepSeek, Mistral, and more).
Novita pre-validates model compatibility and applies engine optimizations, resulting in faster startup and higher stability.

Hugging Face Models

Enter a HuggingFace repository ID (e.g. meta-llama/Meta-Llama-3-8B-Instruct).
  • Public models: No token needed, deploy directly
  • Private or Gated models: A HuggingFace Access Token must be linked first
How to link your HF Token:
  1. Go to HuggingFace → Settings → Access Tokens and create a token
  2. In the Model field on the Create Deployment form, click Integrate HF Token
  3. Paste and save the token
If your token expires or is revoked, active Deployments that rely on it will fail to re-pull the model. Keep your token up to date.

LoRA Adapter (Optional)

After selecting a Base Model, you can attach one or more LoRA Adapters from HuggingFace. Multiple adapters can run on the same Deployment without requiring additional GPU resources. See Section 6 — LoRA Adapter Support for details. Model file format requirements (for custom HuggingFace models):

3.3 Selecting a GPU Instance

The system automatically recommends a GPU configuration based on your model size.
TIGHT MEMORY warning: If the selected GPU has limited VRAM for the chosen model, the system shows a TIGHT MEMORY warning. Increase the GPU count or contact Novita support.
GPU type cannot be changed after a Deployment is created. To switch GPU type, delete and recreate the Deployment.

3.4 Configuring Autoscaling

Autoscaling controls how many replicas run in response to traffic. Use the dual-handle slider to set the replica range:
ParameterDescriptionDefault
Min ReplicasMinimum active replicas at all times. Set to 0 to enable Scale-to-Zero1
Max ReplicasMaximum replicas during peak traffic3
Scale-down DelaySeconds to wait after traffic drops before scaling down (prevents flapping)300s (minimum)
Scale-to-Zero (Min Replicas = 0):
  • After idling for longer than the Scale-down Delay, the Deployment enters SLEEPING status and billing pauses
  • The first incoming request wakes it up automatically
  • Cold start time: typically 5 minutes depending on model size
  • ⚠️ Best suited for dev/test or low-frequency workloads. For production, keep Min Replicas ≥ 1

Disable Autoscaling

Runs a fixed number of replicas. Best for workloads with strict latency SLAs that cannot tolerate any scaling delay.

3.5 Engine Settings (Advanced)

Novita supports two inference engines — vLLM and SGLang — matched automatically to your model. These settings are hidden by default during Deployment creation.

Max Concurrency per Replica

Controls how many requests a single replica handles simultaneously.
SettingEffect
Below recommendedLower latency, but limited throughput
Equal to recommendedOptimal balance of throughput and latency (recommended)
Above recommendedHigher throughput, but increased per-request latency
The system calculates a recommended value based on your GPU instance. Default is 16.

Suffix Decoding

N-gram based speculative decoding that pre-generates future tokens to speed up inference.
  • Most effective for highly predictable output formats (e.g. code generation, structured JSON)
  • Provides limited benefit for free-form conversation; excessively high values may actually increase latency

4. Deployment Lifecycle & Status

State Transition Diagram

Create


PENDING ──── Waiting for GPU resource allocation


DEPLOYING ── Three sub-phases:
  │            ├─ Requesting GPU
  │            ├─ Downloading Model
  │            └─ Engine Initializing

  ├──────────────── FAILED (deployment failed)


RUNNING ──── Live and accepting requests

  ├─ Zero traffic + Scale-to-Zero enabled ──► SLEEPING
  │                                               │
  │                                 First request ──► DEPLOYING ──► RUNNING

  ├─ Config update ──► ROLLING (zero-downtime rolling update)

  ├─ Traffic change ──► SCALING (autoscaling in progress)

  └─ Manual terminate ──► TERMINATING ──► TERMINATED (can be redeployed or deleted)
When billing starts: Only running replicas are billed. Instances still deploying and replicas still scaling up do not count toward charges.

5. Autoscaling In Depth

How It Works

Novita autoscaling monitors live traffic and dynamically adjusts replica count within the Min–Max range:
  • Scale-Up: Request queue backlog detected → add replicas → more GPUs handle requests in parallel
  • Scale-Down: Traffic drops → wait for Scale-down Delay to expire → reduce replicas
  • Scale-to-Zero: When Min Replicas = 0 and the Deployment has been idle past the delay, it enters SLEEPING and billing stops

Cost vs. Availability Trade-off

ConfigurationCostAvailabilityBest for
Min=0, Max=NLowest (no charge when idle)Cold start delay (5 min)Dev/test, low-frequency workloads
Min=1, Max=NMediumAlways available, scales on demandMost production workloads ✅
Min=N, Max=N (fixed)HighestNo scaling delay at allUltra-low latency SLA requirements

Per-Replica Cost

Each additional replica adds cost at the same GPU rate as the base replica. Example: a 2× H100 Deployment that scales to 2 replicas doubles the GPU cost.

Best Practices

  • Set Min Replicas = 1 in production to avoid cold starts impacting end users
  • The default Scale-down Delay of 300s (5 minutes) works well for most cases; increase it if your traffic is highly variable
  • Set Max Replicas to no more than 1.5× your expected (peak QPS / per-replica QPS) to avoid unexpected cost spikes

6. LoRA Adapter Support

What is LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds lightweight adapter layers on top of a Base Model to customize it for specific tasks — without retraining the full model.

Using LoRA in Novita Deployments

Adding adapters at creation time: In Create Deployment → Model field → after selecting a Base Model, click + Add Adapter and enter the LoRA adapter’s HuggingFace repository ID. Viewing adapters at runtime: In the Engine Configuration panel, a +N LoRA badge appears next to the Model ID. Hover over it to see the full list of attached adapters.

Multi-LoRA: Multiple Adapters on One Deployment

A single Deployment can run multiple LoRA adapters simultaneously. Specify which adapter to use per request via the model field:
Multi-LoRA requires no extra GPU resources. All adapters share a single copy of the Base Model weights in memory.

7. Billing

Billing Unit

Charged by GPU-second: number of GPUs × seconds running × unit price.

When Billing Starts and Stops

EventDetails
Billing startsAfter GPU allocation completes during DEPLOYING (i.e. when Downloading Model begins)
Billing stopsWhen the Deployment enters SLEEPING or TERMINATED status
Continuous billingA RUNNING Deployment is billed even when it receives zero API requests

GPU Pricing

For the latest pricing, refer to the Novita pricing page.

Billing Example

Scenario: A customer deploys model instance X on a single RTX 4090 (priced at $0.61/GPU/hour), with autoscaling set to Min=0, Max=5. Usage and charges for 9:00–10:00:
  1. 9:00:00 – 9:15:40 — Instance is SLEEPING. Charge: $0.00
  2. 9:15:41 – 9:16:45 — 1 running replica serving traffic (65 seconds). Charge: (0.61÷3600)×1replica×65s=0.61 ÷ 3600) × 1 replica × 65s = **0.011**
  3. 9:16:46 – 10:00:00 — 2 running replicas serving traffic (1,994 seconds). Charge: (0.61÷3600)×2replicas×1,994s=0.61 ÷ 3600) × 2 replicas × 1,994s = **0.676**
Total for 9:00–10:00: 0+0 + 0.011 + 0.676=0.676 = 0.687

Cost Control Tips

  1. Enable Scale-to-Zero (Min Replicas = 0) for low-frequency workloads — zero cost when idle
  2. Audit your Deployment list regularly and delete unused Deployments
  3. Cap Max Replicas conservatively to prevent unexpected cost spikes from runaway autoscaling
  4. TERMINATED status costs nothing — terminate and redeploy on demand

8. FAQ & Troubleshooting

Deployment Issues

Q: My Deployment has been stuck in DEPLOYING for a long time — what should I do?
  • Requesting GPU: GPU resources may be constrained. Wait 5–10 minutes, or try a different GPU type
  • Downloading Model: Large models (70B+) can take 10+ minutes to download
  • Engine Initializing: Should complete within 5 minutes under normal conditions
Q: My Deployment shows FAILED — what are the common causes?
  • Model is not in .safetensors format (.bin is not supported)
  • HuggingFace Token is invalid or lacks access to a gated model
  • Insufficient GPU VRAM for the model (TIGHT MEMORY configuration)
  • Model architecture is not yet supported
Debugging steps: check the change log in the Settings Tab → verify model file format → validate the HF Token → increase GPU count and recreate the Deployment. Q: My Deployment is SLEEPING — how do I wake it up? Send any API request to it. The Deployment wakes up automatically. The first request waits for the cold start to complete before receiving a response.

API Issues

Q: What do the common HTTP error codes mean?
CodeCauseResolution
400Malformed requestValidate your request JSON; ensure all required fields are present
401Missing or invalid API KeyInclude a valid key in Authorization: Bearer <Key>
403API Key lacks access to this endpointConfirm the key belongs to the same account that owns the Deployment
404Wrong Endpoint URL or Model IDRe-copy the URL and Model ID from the Quick Start panel
422Invalid parameter value (e.g. max_tokens too large)Adjust the parameter — try reducing max_tokens
429Rate limit exceededReduce request frequency, or contact Novita to raise your limit
500Internal server errorRetry after a short wait; if it persists, contact Novita support
Q: Where do I find my API Key? Go to Settings → API Keys to create or manage keys. A key is only shown once at creation — save it immediately.

Billing Issues

Q: Why am I being charged when there are no requests? A RUNNING Deployment continuously occupies GPU resources regardless of request volume. Fix: Enable Autoscaling and set Min Replicas = 0. The Deployment will automatically sleep and stop billing when idle. Q: How do I stop all charges completely? Two options:
  • Scale-to-Zero: Let autoscaling trigger naturally (requires Autoscaling on with Min = 0)
  • Terminate: Click Terminate on the Deployment detail page to release the GPU immediately

Appendix: Glossary

TermDefinition
DeploymentNovita’s dedicated inference endpoint product
ReplicaA single running instance of the inference service; multiple replicas run in parallel
Scale-to-ZeroSetting Min Replicas to 0 so the endpoint sleeps when idle and billing stops
Scale-down DelayWait period before scaling down, preventing flapping on variable traffic
LoRA AdapterLightweight fine-tuning plugin layered on top of a Base Model
Endpoint URLThe API access address for this Deployment
Endpoint IDUnique identifier for this Deployment
Base ModelThe underlying foundation model being served
Max ConcurrencyMaximum simultaneous requests a single replica handles
Suffix DecodingN-gram speculative decoding to accelerate inference on predictable outputs
GPU-secondBilling unit: 1 GPU running for 1 second

For support, contact the Novita team at: support@novita.ai