Loading models...
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
Understand things visually:
Qwen2.5-VL is proficient in recognizing objects like flowers, birds, fish, and insects, and can analyze texts, charts, icons, graphics, and layouts within images.
Being agentic:
Qwen2.5-VL acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use.
Understanding long videos and capturing events:
Qwen2.5-VL can comprehend videos over 1 hour, and capture specific events by pinpointing relevant segments.
Capable of visual localization:
Qwen2.5-VL can localize objects by generating bounding boxes or points and provide stable JSON outputs.
Generating structured outputs:
Supports structured outputs for scanned invoices, forms, tables, etc., useful in finance and commerce.
| Benchmarks | GPT4o | Claude3.5 Sonnet | Gemini-2-flash | InternVL2.5-78B | Qwen2-VL-72B | Qwen2.5-VL-72B | 
|---|---|---|---|---|---|---|
| MMMUval | 70.3 | 70.4 | 70.7 | 70.1 | 64.5 | 70.2 | 
| MMMU_Pro | 54.5 | 54.7 | 57.0 | 48.6 | 46.2 | 51.1 | 
| MathVista_MINI | 63.8 | 65.4 | 73.1 | 76.6 | 70.5 | 74.8 | 
| MathVision_FULL | 30.4 | 38.3 | 41.3 | 32.2 | 25.9 | 38.1 | 
| Hallusion Bench | 55.0 | 55.16 | 57.4 | 58.1 | 55.16 | |
| MMBench_DEV_EN_V11 | 82.1 | 83.4 | 83.0 | 88.5 | 86.6 | 88 | 
| AI2D_TEST | 84.6 | 81.2 | 89.1 | 88.1 | 88.4 | |
| ChartQA_TEST | 86.7 | 90.8 | 85.2 | 88.3 | 88.3 | 89.5 | 
| DocVQA_VAL | 91.1 | 95.2 | 92.1 | 96.5 | 96.1 | 96.4 | 
| MMStar | 64.7 | 65.1 | 69.4 | 69.5 | 68.3 | 70.8 | 
| MMVet_turbo | 69.1 | 70.1 | 72.3 | 74.0 | 76.19 | |
| OCRBench | 736 | 788 | 854 | 877 | 885 | |
| OCRBench-V2 (en/zh) | 46.5/32.3 | 45.2/39.6 | 51.9/43.1 | 45/46.2 | 47.8/46.1 | 61.5/63.7 | 
| CC-OCR | 66.6 | 62.7 | 73.0 | 64.7 | 68.7 | 79.8 | 
| Benchmarks | GPT4o | Gemini-1.5-Pro | InternVL2.5-78B | Qwen2VL-72B | Qwen2.5VL-72B | 
|---|---|---|---|---|---|
| VideoMME w/o sub. | 71.9 | 75.0 | 72.1 | 71.2 | 73.3 | 
| VideoMME w sub. | 77.2 | 81.3 | 74.0 | 77.8 | 79.1 | 
| MVBench | 64.6 | 60.5 | 76.4 | 73.6 | 70.4 | 
| MMBench-Video | 1.63 | 1.30 | 1.97 | 1.70 | 2.02 | 
| LVBench | 30.8 | 33.1 | - | 41.3 | 47.3 | 
| EgoSchema | 72.2 | 71.2 | - | 77.9 | 76.2 | 
| PerceptionTest_test | - | - | - | 68.0 | 73.2 | 
| MLVU_M-Avg_dev | 64.6 | - | 75.7 | - | 74.6 | 
| TempCompass_overall | 73.8 | - | - | - | 74.8 | 
| Benchmarks | GPT4o | Gemini 2.0 | Claude | Aguvis-72B | Qwen2VL-72B | Qwen2.5VL-72B | 
|---|---|---|---|---|---|---|
| ScreenSpot | 18.1 | 84.0 | 83.0 | 87.1 | ||
| ScreenSpot Pro | 17.1 | 1.6 | 43.6 | |||
| AITZ_EM | 35.3 | 72.8 | 83.2 | |||
| Android Control High_EM | 66.4 | 59.1 | 67.36 | |||
| Android Control Low_EM | 84.4 | 59.2 | 93.7 | |||
| AndroidWorld_SR | 34.5% | 27.9% | 26.1% | 35% | ||
| MobileMiniWob++_SR | 66% | 68% | ||||
| OSWorld | 14.90 | 10.26 | 8.83 | 
Install the latest transformers:
1pip install git+https://github.com/huggingface/transformers accelerate
Install qwen-vl-utils:
1pip install qwen-vl-utils[decord]==0.0.8
If decord installation fails on non-Linux, fallback:
1pip install qwen-vl-utils
1from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor 2from qwen_vl_utils import process_vision_info 3 4model = Qwen2_5_VLForConditionalGeneration.from_pretrained( 5 "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto" 6) 7processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct") 8 9messages = [ 10 { 11 "role": "user", 12 "content": [ 13 { 14 "type": "image", 15 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", 16 }, 17 {"type": "text", "text": "Describe this image."}, 18 ], 19 } 20] 21 22text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 23image_inputs, video_inputs = process_vision_info(messages) 24 25inputs = processor( 26 text=[text], 27 images=image_inputs, 28 videos=video_inputs, 29 padding=True, 30 return_tensors="pt", 31) 32inputs = inputs.to("cuda") 33 34generated_ids = model.generate(**inputs, max_new_tokens=128) 35generated_ids_trimmed = [ 36 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 37] 38output_text = processor.batch_decode( 39 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 40) 41print(output_text)
min_pixels and max_pixels or specify exact resized_height and resized_width.Example:
1processor = AutoProcessor.from_pretrained( 2 "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=256*28*28, max_pixels=1280*28*28 3)
Use YaRN to enable context length up to 32,768 tokens:
1{ 2 "type": "yarn", 3 "mrope_section": [16, 24, 24], 4 "factor": 4, 5 "original_max_position_embeddings": 32768 6}
Note: Using YaRN may impact spatial and temporal localization performance.
1@misc{qwen2.5-VL, 2 title = {Qwen2.5-VL}, 3 url = {https://qwenlm.github.io/blog/qwen2.5-vl/}, 4 author = {Qwen Team}, 5 month = {January}, 6 year = {2025} 7} 8 9@article{Qwen2VL, 10 title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}, 11 author={Wang, Peng and Bai, Shuai and ...}, 12 journal={arXiv preprint arXiv:2409.12191}, 13 year={2024} 14} 15 16@article{Qwen-VL, 17 title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, 18 author={Bai, Jinze and Bai, Shuai and ...}, 19 journal={arXiv preprint arXiv:2308.12966}, 20 year={2023} 21}