Loading models...

README

Qwen2.5-VL-72B-Instruct

Introduction

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.

Key Enhancements

Understand things visually:
Qwen2.5-VL is proficient in recognizing objects like flowers, birds, fish, and insects, and can analyze texts, charts, icons, graphics, and layouts within images.
Being agentic:
Qwen2.5-VL acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use.
Understanding long videos and capturing events:
Qwen2.5-VL can comprehend videos over 1 hour, and capture specific events by pinpointing relevant segments.
Capable of visual localization:
Qwen2.5-VL can localize objects by generating bounding boxes or points and provide stable JSON outputs.
Generating structured outputs:
Supports structured outputs for scanned invoices, forms, tables, etc., useful in finance and commerce.

Model Architecture Updates

Dynamic Resolution and Frame Rate Training for Video Understanding

Dynamic FPS sampling enables the model to understand videos at various frame rates.
Updated mRoPE in the time dimension using IDs and absolute time alignment, enabling learning of temporal sequences and speeds.

Streamlined and Efficient Vision Encoder

Introduced window attention into ViT.
Enhanced ViT with SwiGLU and RMSNorm.
Architecture aligns with the Qwen2.5 LLM structure.

Evaluation

Image Benchmark

Benchmarks	GPT4o	Claude3.5 Sonnet	Gemini-2-flash	InternVL2.5-78B	Qwen2-VL-72B	Qwen2.5-VL-72B
MMMUval	70.3	70.4	70.7	70.1	64.5	70.2
MMMU_Pro	54.5	54.7	57.0	48.6	46.2	51.1
MathVista_MINI	63.8	65.4	73.1	76.6	70.5	74.8
MathVision_FULL	30.4	38.3	41.3	32.2	25.9	38.1
Hallusion Bench	55.0	55.16		57.4	58.1	55.16
MMBench_DEV_EN_V11	82.1	83.4	83.0	88.5	86.6	88
AI2D_TEST	84.6	81.2		89.1	88.1	88.4
ChartQA_TEST	86.7	90.8	85.2	88.3	88.3	89.5
DocVQA_VAL	91.1	95.2	92.1	96.5	96.1	96.4
MMStar	64.7	65.1	69.4	69.5	68.3	70.8
MMVet_turbo	69.1	70.1		72.3	74.0	76.19
OCRBench	736	788		854	877	885
OCRBench-V2 (en/zh)	46.5/32.3	45.2/39.6	51.9/43.1	45/46.2	47.8/46.1	61.5/63.7
CC-OCR	66.6	62.7	73.0	64.7	68.7	79.8

Video Benchmark

Benchmarks	GPT4o	Gemini-1.5-Pro	InternVL2.5-78B	Qwen2VL-72B	Qwen2.5VL-72B
VideoMME w/o sub.	71.9	75.0	72.1	71.2	73.3
VideoMME w sub.	77.2	81.3	74.0	77.8	79.1
MVBench	64.6	60.5	76.4	73.6	70.4
MMBench-Video	1.63	1.30	1.97	1.70	2.02
LVBench	30.8	33.1	-	41.3	47.3
EgoSchema	72.2	71.2	-	77.9	76.2
PerceptionTest_test	-	-	-	68.0	73.2
MLVU_M-Avg_dev	64.6	-	75.7	-	74.6
TempCompass_overall	73.8	-	-	-	74.8

Agent Benchmark

Benchmarks	GPT4o	Gemini 2.0	Claude	Aguvis-72B	Qwen2VL-72B	Qwen2.5VL-72B
ScreenSpot	18.1	84.0	83.0			87.1
ScreenSpot Pro			17.1		1.6	43.6
AITZ_EM	35.3				72.8	83.2
Android Control High_EM				66.4	59.1	67.36
Android Control Low_EM				84.4	59.2	93.7
AndroidWorld_SR	34.5%		27.9%	26.1%		35%
MobileMiniWob++_SR				66%		68%
OSWorld			14.90	10.26		8.83

Requirements

Install the latest transformers:

1pip install git+https://github.com/huggingface/transformers accelerate

Install qwen-vl-utils:

1pip install qwen-vl-utils[decord]==0.0.8

If decord installation fails on non-Linux, fallback:

1pip install qwen-vl-utils

Quickstart

1from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
2from qwen_vl_utils import process_vision_info
3
4model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
5    "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
6)
7processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
8
9messages = [
10    {
11        "role": "user",
12        "content": [
13            {
14                "type": "image",
15                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
16            },
17            {"type": "text", "text": "Describe this image."},
18        ],
19    }
20]
21
22text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
23image_inputs, video_inputs = process_vision_info(messages)
24
25inputs = processor(
26    text=[text],
27    images=image_inputs,
28    videos=video_inputs,
29    padding=True,
30    return_tensors="pt",
31)
32inputs = inputs.to("cuda")
33
34generated_ids = model.generate(**inputs, max_new_tokens=128)
35generated_ids_trimmed = [
36    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
37]
38output_text = processor.batch_decode(
39    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
40)
41print(output_text)

Additional Usage Tips

Supports: local files, URLs, and base64 images.
Flexible Image Resolutions:
You can adjust min_pixels and max_pixels or specify exact resized_height and resized_width.

Example:

1processor = AutoProcessor.from_pretrained(
2    "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=256*28*28, max_pixels=1280*28*28
3)

Processing Long Texts

Use YaRN to enable context length up to 32,768 tokens:

1{
2    "type": "yarn",
3    "mrope_section": [16, 24, 24],
4    "factor": 4,
5    "original_max_position_embeddings": 32768
6}

Note: Using YaRN may impact spatial and temporal localization performance.

Citation

1@misc{qwen2.5-VL,
2    title = {Qwen2.5-VL},
3    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
4    author = {Qwen Team},
5    month = {January},
6    year = {2025}
7}
8
9@article{Qwen2VL,
10  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
11  author={Wang, Peng and Bai, Shuai and ...},
12  journal={arXiv preprint arXiv:2409.12191},
13  year={2024}
14}
15
16@article{Qwen-VL,
17  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
18  author={Bai, Jinze and Bai, Shuai and ...},
19  journal={arXiv preprint arXiv:2308.12966},
20  year={2023}
21}