Loading models...

README

Qwen2.5-VL-72B-Instruct

Introduction

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.

Key Enhancements

  • Understand things visually:
    Qwen2.5-VL is proficient in recognizing objects like flowers, birds, fish, and insects, and can analyze texts, charts, icons, graphics, and layouts within images.

  • Being agentic:
    Qwen2.5-VL acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use.

  • Understanding long videos and capturing events:
    Qwen2.5-VL can comprehend videos over 1 hour, and capture specific events by pinpointing relevant segments.

  • Capable of visual localization:
    Qwen2.5-VL can localize objects by generating bounding boxes or points and provide stable JSON outputs.

  • Generating structured outputs:
    Supports structured outputs for scanned invoices, forms, tables, etc., useful in finance and commerce.


Model Architecture Updates

Dynamic Resolution and Frame Rate Training for Video Understanding

  • Dynamic FPS sampling enables the model to understand videos at various frame rates.
  • Updated mRoPE in the time dimension using IDs and absolute time alignment, enabling learning of temporal sequences and speeds.

Streamlined and Efficient Vision Encoder

  • Introduced window attention into ViT.
  • Enhanced ViT with SwiGLU and RMSNorm.
  • Architecture aligns with the Qwen2.5 LLM structure.

Evaluation

Image Benchmark

BenchmarksGPT4oClaude3.5 SonnetGemini-2-flashInternVL2.5-78BQwen2-VL-72BQwen2.5-VL-72B
MMMUval70.370.470.770.164.570.2
MMMU_Pro54.554.757.048.646.251.1
MathVista_MINI63.865.473.176.670.574.8
MathVision_FULL30.438.341.332.225.938.1
Hallusion Bench55.055.1657.458.155.16
MMBench_DEV_EN_V1182.183.483.088.586.688
AI2D_TEST84.681.289.188.188.4
ChartQA_TEST86.790.885.288.388.389.5
DocVQA_VAL91.195.292.196.596.196.4
MMStar64.765.169.469.568.370.8
MMVet_turbo69.170.172.374.076.19
OCRBench736788854877885
OCRBench-V2 (en/zh)46.5/32.345.2/39.651.9/43.145/46.247.8/46.161.5/63.7
CC-OCR66.662.773.064.768.779.8

Video Benchmark

BenchmarksGPT4oGemini-1.5-ProInternVL2.5-78BQwen2VL-72BQwen2.5VL-72B
VideoMME w/o sub.71.975.072.171.273.3
VideoMME w sub.77.281.374.077.879.1
MVBench64.660.576.473.670.4
MMBench-Video1.631.301.971.702.02
LVBench30.833.1-41.347.3
EgoSchema72.271.2-77.976.2
PerceptionTest_test---68.073.2
MLVU_M-Avg_dev64.6-75.7-74.6
TempCompass_overall73.8---74.8

Agent Benchmark

BenchmarksGPT4oGemini 2.0ClaudeAguvis-72BQwen2VL-72BQwen2.5VL-72B
ScreenSpot18.184.083.087.1
ScreenSpot Pro17.11.643.6
AITZ_EM35.372.883.2
Android Control High_EM66.459.167.36
Android Control Low_EM84.459.293.7
AndroidWorld_SR34.5%27.9%26.1%35%
MobileMiniWob++_SR66%68%
OSWorld14.9010.268.83

Requirements

Install the latest transformers:

1pip install git+https://github.com/huggingface/transformers accelerate

Install qwen-vl-utils:

1pip install qwen-vl-utils[decord]==0.0.8

If decord installation fails on non-Linux, fallback:

1pip install qwen-vl-utils

Quickstart

1from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor 2from qwen_vl_utils import process_vision_info 3 4model = Qwen2_5_VLForConditionalGeneration.from_pretrained( 5 "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto" 6) 7processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct") 8 9messages = [ 10 { 11 "role": "user", 12 "content": [ 13 { 14 "type": "image", 15 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", 16 }, 17 {"type": "text", "text": "Describe this image."}, 18 ], 19 } 20] 21 22text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 23image_inputs, video_inputs = process_vision_info(messages) 24 25inputs = processor( 26 text=[text], 27 images=image_inputs, 28 videos=video_inputs, 29 padding=True, 30 return_tensors="pt", 31) 32inputs = inputs.to("cuda") 33 34generated_ids = model.generate(**inputs, max_new_tokens=128) 35generated_ids_trimmed = [ 36 out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 37] 38output_text = processor.batch_decode( 39 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 40) 41print(output_text)

Additional Usage Tips

  • Supports: local files, URLs, and base64 images.
  • Flexible Image Resolutions:
    You can adjust min_pixels and max_pixels or specify exact resized_height and resized_width.

Example:

1processor = AutoProcessor.from_pretrained( 2 "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=256*28*28, max_pixels=1280*28*28 3)

Processing Long Texts

Use YaRN to enable context length up to 32,768 tokens:

1{ 2 "type": "yarn", 3 "mrope_section": [16, 24, 24], 4 "factor": 4, 5 "original_max_position_embeddings": 32768 6}

Note: Using YaRN may impact spatial and temporal localization performance.


Citation

1@misc{qwen2.5-VL, 2 title = {Qwen2.5-VL}, 3 url = {https://qwenlm.github.io/blog/qwen2.5-vl/}, 4 author = {Qwen Team}, 5 month = {January}, 6 year = {2025} 7} 8 9@article{Qwen2VL, 10 title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}, 11 author={Wang, Peng and Bai, Shuai and ...}, 12 journal={arXiv preprint arXiv:2409.12191}, 13 year={2024} 14} 15 16@article{Qwen-VL, 17 title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, 18 author={Bai, Jinze and Bai, Shuai and ...}, 19 journal={arXiv preprint arXiv:2308.12966}, 20 year={2023} 21}