Novita AI MiniMax Speech 2.5 Turbo Preview ASYNC API

MiniMax Speech-2.5-turbo-preview Async Long TTS

curl --request POST \
  --url https://api.novita.ai/v3/async/minimax-speech-2.5-turbo-preview \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "language_boost": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "task_id": "<string>"
}

POST

async

minimax-speech-2.5-turbo-preview

MiniMax Speech-2.5-turbo-preview Async Long TTS

curl --request POST \
  --url https://api.novita.ai/v3/async/minimax-speech-2.5-turbo-preview \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "language_boost": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "task_id": "<string>"
}

MiniMax’s high-definition text-to-speech model, Compared to Speech 02 released in May, Speech 2.5 has three major breakthroughs: stronger multilingual expressiveness, more accurate voice replication, and broader coverage with 40 languages.

Best suited for long-form text-to-speech generation, such as entire books. Task queue times may be longer. For short sentence generation, voice chat, or online social scenarios, we recommend using synchronous TTS.

Request Headers

Content-Type

string

required

Supports: application/json

Authorization

string

required

Bearer authentication format, for example: Bearer {{API Key}}.

Request Body

text

string

required

The text to be synthesized. Maximum length: 50,000 characters.

voice_setting

object

required

Show properties

speed

number

Range: [0.5, 2], default is 1.0.Controls the speech rate of the generated audio. Optional. Higher values result in faster speech.

vol

number

Range: (0, 10], default is 1.0.Controls the volume of the generated audio. Optional. Higher values result in louder audio.

pitch

number

default:0

Range: [-12, 12], default is 0.Controls the pitch of the generated audio. Optional. 0 means original timbre. The value must be an integer.

voice_id

string

The requested voice ID.Supports both system voices (ID) and cloned voices (ID). The available system voice IDs are as follows:

Wise_Woman
Friendly_Person
Inspirational_girl
Deep_Voice_Man
Calm_Woman
Casual_Guy
Lively_Girl
Patient_Man
Young_Knight
Determined_Man
Lovely_Girl
Decent_Boy
Imposing_Manner
Elegant_Man
Abbess
Sweet_Girl_2
Exuberant_Girl

emotion

string

Controls the emotion of the synthesized speech.Currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral.Allowed values: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]

text_normalization

bool

default:"false"

This parameter supports English text normalization, which improves performance in number-reading scenarios, but this comes at the cost of a slight increase in latency. If not provided, the default is false.

audio_setting

object

Show properties

sample_rate

number

default:32000

Range: [8000, 16000, 22050, 24000, 32000, 44100]The sample rate of the generated audio. Optional, default is 32000.

bitrate

number

default:128000

Range: [32000, 64000, 128000, 256000]The bitrate of the generated audio. Optional, default is 128000. This parameter only applies to mp3 audio format.

format

string

default:"mp3"

The audio format of the output. Default is mp3. Options: mp3, pcm, flac, wav. wav is only supported for non-streaming output.

channel

number

default:1

Number of audio channels. Default is 1 (mono). Options:1: mono2: stereo

pronunciation_dict

object

Show properties

tone

list

Replacement of text, symbols and corresponding pronunciations that require manual handling. Replace the pronunciation (adjust the tone/replace the pronunciation of other characters) using the following format:[“omg/oh my god”]For Chinese texts, tones are replaced by numbers, with 1 for the first tone (high), 2 for the second tone (rising), 3 for the third tone (low/dipping), 4 for the fourth tone (falling), and 5 for the fifth tone (neutral).

language_boost

string

default:"null"

Enhances recognition of specified minor languages and dialects. Setting this parameter can improve speech performance in the specified language/dialect scenarios. If the minor language type is not clear, you can set it to “auto” and the model will automatically determine the language type. Supported values:

'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'Bulgarian', 'Danish', 'Hebrew', 'Malay', 'Persian', 'Slovak', 'Swedish', 'Croatian', 'Filipino', 'Hungarian', 'Norwegian', 'Slovenian', 'Catalan', 'Nynorsk', 'Tamil', 'Afrikaans', 'auto'

voice_modify

object

Voice FX settings. Supported audio formats for this parameter: mp3, wav, flac

Show properties

pitch

integer

Pitch adjustment (darker/brighter). Range: [-100, 100]. Values closer to -100 generate a deeper (darker) voice; values closer to 100 produce a brighter voice.

intensity

integer

Intensity adjustment (powerful/soft). Range: [-100, 100]. Values closer to -100 generate a more powerful sound; values closer to 100 result in a softer sound.

timbre

integer

Timbre adjustment (magnetic/crisp). Range: [-100, 100]. Values closer to -100 make the voice more full/magnetic; values closer to 100 make it crisper.

sound_effects

string

Sound effect setting (only one may be selected per request). Valid values:

spacious_echo (large space echo)
auditorium_echo (auditorium broadcast)
lofi_telephone (telephone distortion)
robotic (robotic effect)

Response

task_id

string

Use the task_id to request the Task Result API to retrieve the generated outputs.

MiniMax Speech-2.5-turbo-preview Text to Speech MiniMax Speech-2.6-hd Text to Speech

Overview

Basic

Model APIs

GPUs

MiniMax Speech-2.5-turbo-preview Async Long TTS

Request Headers

Request Body

Response

Overview

Basic

Model APIs

GPUs

​Request Headers

​Request Body

​Response

Request Headers

Request Body

Response