DEDICATED ENDPOINT

Run Your Models,
We Handle the Rest

Deploy models on dedicated GPUs. Per-second billing on running replicas only, affordable GPU options starting at $0.61/hr, and inference issues are on us.

Pay only for what's running

Per-second billing on active replicas. Scale to zero, pay zero. No charges for idle endpoints, no minimum commitments.

Billing preview

3 replicas 4090 × 2h 15m 42s running

= 3 × 8,142s × $0.000608/s

= $14.85

GPUs for every budget

Not every workload needs an H200. Choose from RTX 4090, RTX 5090, H100 — match the GPU to your model size and your budget.

RTX 4090

24 GB

from$0.61/hr

H100

80 GB

from$1.99/hr

H200

141 GB

from$2.99/hr

Inference breaks? That's on us.

OOM, CUDA errors, model loading failures — our team diagnoses and resolves. You get a clear explanation, not a cryptic stack trace.

Report via console or email

Our team investigates root cause

You get diagnosis + resolution

Deployment options

Serverless vs Dedicated

Choose the right deployment model for your workload

Serverless Endpoints

  • Pay per token
  • 600+ models available
  • Zero infrastructure management
  • Auto-scaling included
  • Best for variable workloads
Recommended for production

Dedicated Endpoints

  • Isolated GPU resources
  • Guaranteed latency SLA
  • Custom models & LoRA adapters
  • Scale-to-zero support
  • Best for predictable, high-throughput workloads

Platform

Built for Production Workloads

Everything you need to deploy, scale, and manage AI inference in production.

Blazing-Fast Inference

Powered by optimized serving engines on NVIDIA H200, H100, and RTX 4090 GPUs. Sub-second latency for real-time applications.

Dynamic Auto-Scaling

Scale from 0 to N replicas automatically based on traffic. Set min/max replicas and scale-down delay to match your traffic patterns.

LoRA Adapter Support

Hot-swap LoRA adapters on running endpoints without restarts. Deploy multiple fine-tuned variants on a single base model.

Flexible GPU Options

Choose from NVIDIA H200, H100, and RTX 4090. Select tensor parallelism and GPU count to match your model requirements.

Pay-Per-Hour Billing

Billed by GPU-hour with per-second granularity. Scale to zero when idle — no minimum commitment, no idle GPU costs.

Enterprise Support

Dedicated technical support, custom SLAs, and priority access to new GPU types. Volume discounts for large deployments.

Workflow

Deploy in 3 Steps

Everything you need to deploy, scale, and manage AI inference in production.

001

Pick your model

Search 50K+ models from Hugging Face, or paste your private repo URL.

Search models... e.g. Qwen2.5-7B
Qwen2.5-7B-Instruct
Llama-3.1-8B
Mistral-7B-v0.3

002

Choose your GPU

See the recommended GPU for your model. Pick the one that fits your budget.

RTX 4090 · $0.61/hr
H100 · $1.99/hr
H200 · $2.99/hr
Replicas
1

003

Deploy

Your endpoint is live in minutes. OpenAI-compatible URL ready to use.

Pricing

Transparent GPU Pricing

GPUVRAMPrice / GPU-hour
NVIDIA H200 SXMPopular
141 GB$2.99
NVIDIA H100 SXM
80 GB$1.99
NVIDIA RTX 4090
24 GB$0.61

Need reserved capacity or custom pricing?

Talk to our team

FAQ

Frequently Asked Questions

Per-second on running replicas only. When your endpoint is scaled to zero or stopped, you pay nothing. No minimum commitments, no idle charges.

Still have questions? Contact support

Everything you need to build production AI.

200+ models, on-demand GPUs, and secure agent runtimes — unified under one API. Free to start, scales as you grow.