Wan 2.7 Reference-to-Video model with multimodal input support (text/image/video). Can generate single-character performance or multi-character interaction videos using reference characters. Supports intelligent multi-shot generation. 720P and 1080P resolutions, duration 2~10 seconds, billed per second. Output includes audio by default.
This is an asynchronous API; only the task_id will be returned. You should use the task_id to request the Task Result API to retrieve the video generation results.
Reference media array for character appearance, motion and voice extraction. Items map to character1, character2 etc. in order. Images: 0-5, Videos: 0-3, Total <= 5. Image formats: JPEG, JPG, PNG, BMP, WEBP, resolution [240,8000]px, max 10MB. Video formats: MP4, MOV, duration 1-30s, max 100MB. Audio formats: MP3, WAV, FLAC, duration 3-30s.Array length: 1 - 5
Media type. reference_image: reference image for character appearance; reference_video: reference video for character motion and appearance; first_frame: first frame image to control video starting frame.Optional values: reference_image, reference_video, first_frame
Text prompt describing desired video content. Use character1, character2 etc. to reference characters from media array in order. Each reference (video or image) contains a single character. Supports Chinese and English, max 1500 characters.Length limit: 0 - 1500