Text to synthesize into speech, length limit is less than 10000 characters. If text length is greater than 3000 characters, streaming output is recommended. Supports paragraph breaks (newline), pause control (<#x#> tag), and interjection tags (such as (laughs), (coughs), etc., only supported by speech-2.8-hd/turbo)
Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 produce deeper voice; closer to 100 produce brighter voiceValue range: [-100, 100]
Timbre adjustment (rich/crisp), range [-100, 100]. Values closer to -100 produce richer voice; closer to 100 produce crisper voiceValue range: [-100, 100]
Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 produce more powerful voice; closer to 100 produce softer voiceValue range: [-100, 100]
Sound effect setting, only one can be selected at a time. Options: spacious_echo (spacious echo), auditorium_echo (auditorium broadcast), lofi_telephone (telephone distortion), robotic (electronic)Optional values: spacious_echo, auditorium_echo, lofi_telephone, robotic
Controls constant bitrate (CBR) encoding. When set to true, audio will be encoded with constant bitrate. Note: This parameter only works when streaming output is enabled and audio format is mp3
Controls output format, options are url or hex, default is hex. This parameter is only valid in non-streaming scenarios. URL is valid for 24 hoursOptional values: url, hex
Controls the emotion of synthesized speech. Options correspond to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper. The model will automatically match appropriate emotion based on input textOptional values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
Voice ID for audio synthesis. If mixed voice is needed, set timber_weights parameter and leave this empty. Supports system voice, cloned voice, and text-generated voice
Whether to enable Chinese/English text normalization, which can improve performance in number reading scenarios but slightly increases latency. Default is false
Controls whether to add audio rhythm identifier at the end of synthesized audio, default is false. This parameter is only valid for non-streaming synthesis
Whether to enhance recognition ability for specified minor languages and dialects. Default is null, can be set to auto to let the model decide automaticallyOptional values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto
Sets whether the last chunk contains the concatenated audio hex data. Default is false, meaning the last chunk contains the complete concatenated audio hex data
Weight of each voice in the mix, must be set together with voice_id. Range [1, 100], supports up to 4 voice mixtures. Higher weight means more similarity to that voiceValue range: [1, 100]
Controls whether to enable subtitle service, default is false. This parameter is only valid in non-streaming output scenarios, and only valid for speech-2.6-hd, speech-2.6-turbo, speech-02-turbo, speech-02-hd, speech-01-turbo, speech-01-hd models
Defines pronunciation or replacement rules for special characters or symbols. For Chinese text, tones are represented by numbers: 1st tone = 1, 2nd tone = 2, 3rd tone = 3, 4th tone = 4, neutral tone = 5. Example: [“omg/oh my god”]
Invalid character ratio. If invalid characters do not exceed 10% (inclusive), audio will be generated normally with this ratio data returned; if exceeds 10%, an error will be returned