Pay only for what's running
Per-second billing on active replicas. Scale to zero, pay zero. No charges for idle endpoints, no minimum commitments.
Billing preview
3 replicas 4090 × 2h 15m 42s running
= 3 × 8,142s × $0.000608/s
= $14.85
Deploy models on dedicated GPUs. Per-second billing on running replicas only, affordable GPU options starting at $0.61/hr, and inference issues are on us.
Per-second billing on active replicas. Scale to zero, pay zero. No charges for idle endpoints, no minimum commitments.
Billing preview
3 replicas 4090 × 2h 15m 42s running
= 3 × 8,142s × $0.000608/s
= $14.85
Not every workload needs an H200. Choose from RTX 4090, RTX 5090, H100 — match the GPU to your model size and your budget.
RTX 4090
24 GB
from$0.61/hr
H100
80 GB
from$1.99/hr
H200
141 GB
from$2.99/hr
OOM, CUDA errors, model loading failures — our team diagnoses and resolves. You get a clear explanation, not a cryptic stack trace.
Report via console or email
Our team investigates root cause
You get diagnosis + resolution
Deployment options
Choose the right deployment model for your workload
Platform
Everything you need to deploy, scale, and manage AI inference in production.
Powered by optimized serving engines on NVIDIA H200, H100, and RTX 4090 GPUs. Sub-second latency for real-time applications.
Scale from 0 to N replicas automatically based on traffic. Set min/max replicas and scale-down delay to match your traffic patterns.
Hot-swap LoRA adapters on running endpoints without restarts. Deploy multiple fine-tuned variants on a single base model.
Choose from NVIDIA H200, H100, and RTX 4090. Select tensor parallelism and GPU count to match your model requirements.
Billed by GPU-hour with per-second granularity. Scale to zero when idle — no minimum commitment, no idle GPU costs.
Dedicated technical support, custom SLAs, and priority access to new GPU types. Volume discounts for large deployments.
Catalog
One-click deploy the most popular LLMs, or bring your own Hugging Face model.
Workflow
Everything you need to deploy, scale, and manage AI inference in production.
001
Pick your model
Search 50K+ models from Hugging Face, or paste your private repo URL.
002
Choose your GPU
See the recommended GPU for your model. Pick the one that fits your budget.
003
Deploy
Your endpoint is live in minutes. OpenAI-compatible URL ready to use.
POST
api.example.com/v1/chat/completions
Pricing
| GPU | VRAM | Price / GPU-hour |
|---|---|---|
NVIDIA H200 SXMPopular | 141 GB | $2.99 |
NVIDIA H100 SXM | 80 GB | $1.99 |
NVIDIA RTX 4090 | 24 GB | $0.61 |
Need reserved capacity or custom pricing?
Talk to our teamFAQ
Per-second on running replicas only. When your endpoint is scaled to zero or stopped, you pay nothing. No minimum commitments, no idle charges.
Still have questions? Contact support
200+ models, on-demand GPUs, and secure agent runtimes — unified under one API. Free to start, scales as you grow.