Overview

Vision-Language Models (VLMs) are a type of multimodal foundation model capable of processing both image and text inputs. These models understand visual content in conjunction with language instructions, and generate high-quality responses based on the combined context. They are widely used in scenarios involving image recognition, content interpretation, and intelligent visual Q&A.

Typical Use Cases

  • Image Recognition and Description: Automatically identifies objects, colors, scenes, and spatial relationships in images, and generates natural language descriptions.
  • Multimodal Understanding: Combines image input and contextual text for multi-turn dialogue and task completion.
  • Visual Question Answering: Acts as an advanced OCR tool by recognizing and interpreting text embedded in images.
  • Emerging Applications: Ideal for use in intelligent vision assistants, robot perception, AR interfaces, and more.

API Usage Guide

To invoke a Vision-Language Model, use the /chat/completions endpoint with both image and text inputs.

Image Detail Parameter

Use the detail field to control image resolution. The following modes are available:

  • high: High resolution, preserves more detail—ideal for precision tasks.
  • low: Low resolution, faster response—suitable for real-time usage.
  • auto: Automatically selects the appropriate mode.

Example Message Format

Image via URL

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "Please describe the scene in the image."
    }
  ]
}

Image via Base64

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "What text is present in the image?"
    }
  ]
}

Python Code: Encode Image to Base64

import base64
from PIL import Image
import io

def image_to_base64(image_path):
    with Image.open(image_path) as img:
        buffered = io.BytesIO()
        img.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = image_to_base64("path/to/your/image.jpg")

Multi-Image Input

The API supports sending multiple images alongside text input. For best results, we recommend sending no more than two images per request.

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image1.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "Compare the common features of these two images."
    }
  ]
}

Supported Models

The following Vision-Language Models are currently supported on the Novita platform:

  • meta-llama/llama-4-maverick-17b-128e-instruct-fp8
  • meta-llama/llama-4-scout-17b-16e-instruct
  • google/gemma-3-27b-it
  • qwen/qwen2.5-vl-72b-instruct

Visit the Model Hub for a complete and up-to-date list of available models.


Billing

Image input is tokenized and counted toward billing together with text.

  • Each model uses a different image-to-token conversion method.
  • Refer to each model’s pricing page for detailed billing and token policy.

API Call Examples

Single Image Description

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.novita.ai/v3/openai")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/cityscape.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe the main buildings in the image."
                }
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Multi-Image Comparison

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/product1.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/product2.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Please compare the key differences between these two products."
                }
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Notes & Troubleshooting

  • Image resolution and clarity significantly affect model performance. Use high-quality sources where possible.
  • Base64-encoded images should ideally be under 1MB to avoid timeouts or errors.
  • For detailed usage, book a call with our sales team or contact support if needed.