# Vision Language Models - Documentation

> For the complete documentation index, see [llms.txt](/llms.txt). Markdown is available with `Accept: text/markdown` and `.md` URL variants.

Source: /docs/guides/llm-vision

# Vision Language Models

##

[​](#overview)

Overview

Vision-Language Models (VLMs) are a type of multimodal foundation model capable of processing both image and text inputs. These models understand visual content in conjunction with language instructions, and generate high-quality responses based on the combined context. They are widely used in scenarios involving image recognition, content interpretation, and intelligent visual Q&A.

###

[​](#typical-use-cases)

Typical Use Cases

- Image Recognition and Description: Automatically identifies objects, colors, scenes, and spatial relationships in images, and generates natural language descriptions.

- Multimodal Understanding: Combines image input and contextual text for multi-turn dialogue and task completion.

- Visual Question Answering: Acts as an advanced OCR tool by recognizing and interpreting text embedded in images.

- Emerging Applications: Ideal for use in intelligent vision assistants, robot perception, AR interfaces, and more.

##

[​](#api-usage-guide)

API Usage Guide

To invoke a Vision-Language Model, use the `/chat/completions` endpoint with both image and text inputs.

###

[​](#image-detail-parameter)

Image Detail Parameter

Use the `detail` field to control image resolution. The following modes are available:

- `high`: High resolution, preserves more detail—ideal for precision tasks.

- `low`: Low resolution, faster response—suitable for real-time usage.

- `auto`: Automatically selects the appropriate mode.

###

[​](#example-message-format)

Example Message Format

####

[​](#image-via-url)

Image via URL

```
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
"detail": "high"
}
},
{
"type": "text",
"text": "Please describe the scene in the image."
}
]
}
```

####

[​](#image-via-base64)

Image via Base64

```
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,{base64_image}",
"detail": "low"
}
},
{
"type": "text",
"text": "What text is present in the image?"
}
]
}
```

###

[​](#python-code-encode-image-to-base64)

Python Code: Encode Image to Base64

```
import base64
from PIL import Image
import io

def image_to_base64(image_path):
with Image.open(image_path) as img:
buffered = io.BytesIO()
img.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = image_to_base64("path/to/your/image.jpg")
```

##

[​](#multi-image-input)

Multi-Image Input

The API supports sending multiple images alongside text input. For best results, we recommend sending no more than two images per request.

```
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.png"
}
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "Compare the common features of these two images."
}
]
}
```

##

[​](#supported-models)

Supported Models

The following Vision-Language Models are currently supported on the Novita platform:

Visit the [Model Hub](https://novita.ai/models-console/library) for a complete and up-to-date list of available models.

##

[​](#billing)

Billing

Image input is tokenized and counted toward billing together with text.

- Each model uses a different image-to-token conversion method.

- Refer to each model’s pricing page for detailed billing and token policy.

##

[​](#api-call-examples)

API Call Examples

###

[​](#single-image-description)

Single Image Description

```
from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.novita.ai/openai")

response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/cityscape.jpg"
}
},
{
"type": "text",
"text": "Describe the main buildings in the image."
}
]
}
],
stream=True
)

for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```

###

[​](#multi-image-comparison)

Multi-Image Comparison

```
response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/product1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/product2.jpg"
}
},
{
"type": "text",
"text": "Please compare the key differences between these two products."
}
]
}
],
stream=True
)

for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```

##

[​](#notes-&-troubleshooting)

Notes & Troubleshooting

- Image resolution and clarity significantly affect model performance. Use high-quality sources where possible.

- Base64-encoded images should ideally be under 1MB to avoid timeouts or errors.

- For detailed usage, [book a call with our sales team](https://meet.brevo.com/novita-ai/contact-sales) or contact support if needed.

Last modified on August 12, 2025
