Novita AI MiniMax Speech 2.5 HD Preview API | High-Definition Text-to-Speech

MiniMax’s high-definition text-to-speech model, Compared to Speech 02 released in May, Speech 2.5 has three major breakthroughs: stronger multilingual expressiveness, more accurate voice replication, and broader coverage with 40 languages.

Best suited for short sentence generation, voice chat, and online social scenarios. Processing is fast, but the text length is limited to less than 10,000 characters. For long-form text, we recommend using asynchronous TTS synthesis.

Request Headers

Content-Type

string

required

Supports: application/json

Authorization

string

required

Bearer authentication format, for example: Bearer {{API Key}}.

Request Body

text

string

required

The text to be synthesized. The length must be less than 10,000 characters. Use line breaks to separate paragraphs.
To control the pause duration between speech segments, insert <#x#> between words or sentences, where x is the pause duration in seconds (supports 0.01–99.99, up to two decimal places).
Custom speech pauses between text segments are supported, allowing you to control the timing of pauses in the generated audio.
Note: Pause markers must be placed between two segments that can be pronounced, and multiple consecutive pause markers are not allowed.

voice_setting

object

Show properties

speed

float

default:"1.0"

Range: [0.5, 2], default is 1.0Controls the speech rate of the generated audio. Optional. Higher values result in faster speech.

vol

float

default:"1.0"

Range: (0, 10], default is 1.0Controls the volume of the generated audio. Optional. Higher values result in louder audio.

pitch

int

default:"0"

Range: [-12, 12], default is 0Controls the pitch of the generated audio. Optional. 0 means original timbre. The value must be an integer.

voice_id

string

The requested timbre ID. Required (choose either this or timbre_weights).Supports both system voices (ID) and cloned voices (ID). The available system voice IDs are as follows:

Wise_Woman
Friendly_Person
Inspirational_girl
Deep_Voice_Man
Calm_Woman
Casual_Guy
Lively_Girl
Patient_Man
Young_Knight
Determined_Man
Lovely_Girl
Decent_Boy
Imposing_Manner
Elegant_Man
Abbess
Sweet_Girl_2
Exuberant_Girl

emotion

string

Controls the emotion of the synthesized speech.Currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral.Allowed values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]

text_normalization

bool

default:"false"

This parameter supports English text normalization, which improves performance in number-reading scenarios, but this comes at the cost of a slight increase in latency. If not provided, the default is false.

audio_setting

object

Show properties

sample_rate

number

default:32000

Range: [8000, 16000, 22050, 24000, 32000, 44100]The sample rate of the generated audio. Optional, default is 32000.

bitrate

number

default:128000

Range: [32000, 64000, 128000, 256000]The bitrate of the generated audio. Optional, default is 128000. This parameter only applies to mp3 audio format.

format

string

default:"mp3"

The audio format of the output. Default is mp3. Options: mp3, pcm, flac, wav. wav is only supported for non-streaming output.

channel

number

default:1

Number of audio channels. Default is 1 (mono). Options:1: mono2: stereo

pronunciation_dict

object

Show properties

tone

list

Replacement of text, symbols and corresponding pronunciations that require manual handling. Replace the pronunciation (adjust the tone/replace the pronunciation of other characters) using the following format:[“omg/oh my god”]For Chinese texts, tones are replaced by numbers, with 1 for the first tone (high), 2 for the second tone (rising), 3 for the third tone (low/dipping), 4 for the fourth tone (falling), and 5 for the fifth tone (neutral).

timbre_weights

object[]

Required if voice_id is not provided (choose one of the two).

Show properties

voice_id

string

The requested timbre ID. Must be provided together with the weight parameter.

weight

number

Range: [1, 100]The weight, must be provided together with voice_id. Up to 4 timbres can be mixed. The value must be an integer. The higher the proportion for a single timbre, the more the synthesized voice will resemble it.

stream

boolean

default:"false"

Whether to enable streaming. Default is false, i.e., streaming is disabled.

language_boost

string

default:"null"

Enhances recognition of specified minor languages and dialects. Setting this parameter can improve speech performance in the specified language/dialect scenarios. If the minor language type is not clear, you can set it to “auto” and the model will automatically determine the language type. Supported values:

'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'

output_format

string

default:"hex"

Controls the output format of the result. Optional values are url and hex. The default is hex. This parameter only takes effect in non-streaming scenarios; in streaming mode, only hex format is supported. The returned URL is valid for 24 hours.

Response

audio

string

The synthesized audio segment, encoded in hex and generated in the format specified by audio_setting.format (mp3/pcm/flac). The return format is determined by the output_format parameter. When stream is true, only hex format is supported.

status

number

The current status of the audio stream. Returned only when stream is true. 1 indicates synthesis in progress, 2 indicates synthesis completed.

Basic

Model APIs

GPUs

MiniMax Speech-2.5-hd-preview Text to Speech

Request Headers

Request Body

Response

Basic

Model APIs

GPUs

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response