Vision Language Models
Overview
Vision-Language Models (VLMs) are a type of multimodal foundation model capable of processing both image and text inputs. These models understand visual content in conjunction with language instructions, and generate high-quality responses based on the combined context. They are widely used in scenarios involving image recognition, content interpretation, and intelligent visual Q&A.
Typical Use Cases
- Image Recognition and Description: Automatically identifies objects, colors, scenes, and spatial relationships in images, and generates natural language descriptions.
- Multimodal Understanding: Combines image input and contextual text for multi-turn dialogue and task completion.
- Visual Question Answering: Acts as an advanced OCR tool by recognizing and interpreting text embedded in images.
- Emerging Applications: Ideal for use in intelligent vision assistants, robot perception, AR interfaces, and more.
API Usage Guide
To invoke a Vision-Language Model, use the /chat/completions
endpoint with both image and text inputs.
Image Detail Parameter
Use the detail
field to control image resolution. The following modes are available:
high
: High resolution, preserves more detail—ideal for precision tasks.low
: Low resolution, faster response—suitable for real-time usage.auto
: Automatically selects the appropriate mode.
Example Message Format
Image via URL
Image via Base64
Python Code: Encode Image to Base64
Multi-Image Input
The API supports sending multiple images alongside text input. For best results, we recommend sending no more than two images per request.
Supported Models
The following Vision-Language Models are currently supported on the Novita platform:
meta-llama/llama-4-maverick-17b-128e-instruct-fp8
meta-llama/llama-4-scout-17b-16e-instruct
google/gemma-3-27b-it
qwen/qwen2.5-vl-72b-instruct
Visit the Model Hub for a complete and up-to-date list of available models.
Billing
Image input is tokenized and counted toward billing together with text.
- Each model uses a different image-to-token conversion method.
- Refer to each model’s pricing page for detailed billing and token policy.
API Call Examples
Single Image Description
Multi-Image Comparison
Notes & Troubleshooting
- Image resolution and clarity significantly affect model performance. Use high-quality sources where possible.
- Base64-encoded images should ideally be under 1MB to avoid timeouts or errors.
- For detailed usage, book a call with our sales team or contact support if needed.