MiniMax asynchronous text-to-speech API, supports various voice, emotion, speed and other parameter settings, text length limit up to 50,000 characters, supports file input (up to 100,000 characters)
This is an asynchronous API; only the task_id will be returned. You should use the task_id to request the Task Result API to retrieve the video generation results.
Text file ID for audio synthesis, single file length limit is less than 100,000 characters, supported file formats: txt, zip. Either text or text_file_id is required, format will be automatically validated. • txt file: Length limit <100000 characters. Supports custom pause using <#x#> tag. x is pause duration (in seconds), range [0.01, 99.99], up to 2 decimal places. Pause must be set between two pronounceable text segments, cannot use multiple pause tags consecutively • zip file: • Compressed package must contain txt or json files of the same format. • json file format: Supports [title, content, extra] three fields, representing title, body, and additional information. If all three fields exist, 3 groups of results will be produced, 9 files in total, stored in one folder. If a field does not exist or is empty, no corresponding result will be generated
Pitch adjustment (deep/bright), range [-100, 100], values closer to -100 produce deeper voice; closer to 100 produce brighter voiceValue range: [-100, 100]
Timbre adjustment (rich/crisp), range [-100, 100], values closer to -100 produce richer voice; closer to 100 produce crisper voiceValue range: [-100, 100]
Intensity adjustment (powerful/soft), range [-100, 100], values closer to -100 produce more powerful voice; closer to 100 produce softer voiceValue range: [-100, 100]
Controls the emotion of synthesized speech. Options ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "calm", "fluent", "whisper"] correspond to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
• The model will automatically match appropriate emotion based on input text, usually no need to specify manually
• This parameter only works for speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo models
• Options fluent, whisper only work for speech-2.6-turbo, speech-2.6-hd modelsOptional values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
Voice ID for audio synthesis. If mixed voice is needed, set timber_weights parameter and leave this empty. Supports system voice, cloned voice, and text-generated voice. Below are some of the latest system voices (ID)
• Chinese: • moss_audio_ce44fc67-7ce3-11f0-8de5-96e35d26fb85 • moss_audio_aaa1346a-7ce7-11f0-8e61-2e6e3c7ee85d • Chinese (Mandarin)_Lyrical_Voice • Chinese (Mandarin)_HK_Flight_Attendant • English: • English_Graceful_Lady • English_Insightful_Speaker • English_radiant_girl • English_Persuasive_Man • moss_audio_6dc281eb-713c-11f0-a447-9613c873494c • moss_audio_570551b1-735c-11f0-b236-0adeeecad052 • moss_audio_ad5baf92-735f-11f0-8263-fe5a2fe98ec8 • English_Lucky_Robot • Japanese: • Japanese_Whisper_Belle • moss_audio_24875c4a-7be4-11f0-9359-4e72c55db738 • moss_audio_7f4ee608-78ea-11f0-bb73-1e2a4cfcd245 • moss_audio_c1a6a3ac-7be6-11f0-8e8e-36b92fbb4f95
Controls whether to add audio rhythm identifier at the end of synthesized audio, default is False. This parameter is only valid for non-streaming synthesis
Whether to enhance recognition ability for specified minor languages and dialects. Default is null, can be set to auto to let the model decide automatically.Optional values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto
Defines pronunciation or replacement rules for special characters or symbols. For Chinese text, tones are represented by numbers:
1st tone = 1, 2nd tone = 2, 3rd tone = 3, 4th tone = 4, neutral tone = 5
Example:
["omg/oh my god"]
Corresponding audio file ID returned after task creation.
. This field is not returned when request failsNote: The download URL is valid for 9 hours (32400 seconds) from generation. After expiration, the file will become invalid and generated information will be lost. Please pay attention to download timing