POST
/
v3
/
minimax-speech-02-turbo
MiniMax Speech-02-turbo Text to Speech
curl --request POST \
  --url https://api.novita.ai/v3/minimax-speech-02-turbo \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "english_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "timber_weights": [
    {
      "voice_id": "<string>",
      "weight": 123
    }
  ],
  "stream": true,
  "language_boost": "<string>",
  "output_format": "<string>"
}'
{
  "audio": "<string>",
  "status": 123
}

This API supports synchronous text-to-speech (TTS) generation, with a maximum input length of 10,000 characters per request. It offers over 100 system and cloned voices, with customizable parameters including volume, pitch, speed, and output format. The API supports proportional voice mixing, fixed interval control, and multiple audio formats such as mp3, pcm, flac, and wav. Streaming output is also supported.

When submitting a long-text TTS request, please note that the returned audio URL is valid for 24 hours from the time it is generated. Be sure to download the audio within this period.

Best suited for short sentence generation, voice chat, and online social scenarios. Processing is fast, but the text length is limited to less than 10,000 characters. For long-form text, we recommend using asynchronous TTS synthesis.

Request Headers

Content-Type
string
required

Enum: application/json

Authorization
string
required

Bearer authentication format, for example: Bearer {{API Key}}.

Request Body

text
string
required

The text to be synthesized. The length must be less than 10,000 characters. Use line breaks to separate paragraphs.
To control the pause duration between speech segments, insert <#x#> between words or sentences, where x is the pause duration in seconds (supports 0.01–99.99, up to two decimal places).
Custom speech pauses between text segments are supported, allowing you to control the timing of pauses in the generated audio.
Note: Pause markers must be placed between two segments that can be pronounced, and multiple consecutive pause markers are not allowed.

voice_setting
object
audio_setting
object
pronunciation_dict
object
timber_weights
object[]

Required if voice_id is not provided (choose one of the two).

stream
boolean
default:"false"

Whether to enable streaming. Default is false, i.e., streaming is disabled.

language_boost
string
default:"null"

Enhances recognition of specified minor languages and dialects. Setting this parameter can improve speech performance in the specified language/dialect scenarios. If the minor language type is not clear, you can set it to “auto” and the model will automatically determine the language type. Supported values:

'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'

output_format
string
default:"hex"

Controls the output format of the result. Optional values are url and hex. The default is hex. This parameter only takes effect in non-streaming scenarios; in streaming mode, only hex format is supported. The returned URL is valid for 24 hours.

Response

audio
string

The synthesized audio segment, encoded in hex and generated in the format specified by audio_setting.format (mp3/pcm/flac). The return format is determined by the output_format parameter. When stream is true, only hex format is supported.

status
number

The current status of the audio stream. Returned only when stream is true. 1 indicates synthesis in progress, 2 indicates synthesis completed.

Example

Below is an example of how to use the Minimax Speech-02-turbo synchronous API.

  1. Non-streaming (stream is false)

If output_format is not set to url, the default return format is hex.

Request:

curl \
-X POST https://api.novita.ai/v3/minimax-speech-02-turbo \
-H "Authorization: Bearer $your_api_key" \
-H "Content-Type: application/json" \
-d '{
  "text": "Audio generation technology is evolving rapidly, enabling the creation of speech, music, and sound effects from text or data inputs. It supports applications in media, accessibility, customer service, and content creation. With improved quality and customization, these tools are increasingly integrated into digital platforms across various industries.",
  "stream": false,
  "output_format": "url",
  "voice_setting": {
    "speed": 1.1,
    "voice_id": "Wise_Woman",
    "emotion": "happy"
  }
}'

Response:

{
  "audio": "https://faas-minimax-audio-v2.s3.ap-southeast-1.amazonaws.com/test/60af5b60-5159-421e-9d60-018e6bec4112-9086e96a-dbd1-4588-b025-608312e07244.mp3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIASVPYCN6LRCW3SOUV%2F20250710%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20250710T113309Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=d6e233425b6ab26c772820135394bbd474beaaeb03e1982f659f7e5583bfcab7"
}

Audio file:

  1. Streaming (stream is true)

Request:

curl \
-X POST https://api.novita.ai/v3/minimax-speech-02-turbo \
-H "Authorization: Bearer $your_api_key" \
-H "Content-Type: application/json" \
-d '{
  "text": "Audio generation technology is evolving rapidly, enabling the creation of speech, music, and sound effects from text or data inputs. It supports applications in media, accessibility, customer service, and content creation. With improved quality and customization, these tools are increasingly integrated into digital platforms across various industries.",
  "stream": true,
  "voice_setting": {
    "speed": 1.1,
    "voice_id": "Wise_Woman",
    "emotion": "happy"
  }
}'

Response:

data: {"audio": "fffb98c4d8 ... e12fc5be", "status": 1}
...
data: {"audio": "4944330453 ... 34505005", "status": 1}
...
data: {"audio": "fffb98c45f ... 04f61e30", "status": 1}
...
data: {"status": 2}