Documentation Index Fetch the complete documentation index at: https://novita.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Fish Audio S2 Pro text-to-speech model converts text into natural speech with support for reference voices, sampling controls, chunking, audio formats, and prosody controls.
Supports: application/json
Bearer authentication format, for example: Bearer {{API Key}}.
Request Body
Text to convert to speech. S2-Pro multi-speaker text can use tags such as <|speaker:0|>Hello<|speaker:1|>Hi.
Nucleus sampling diversity control. Value range: [0, 1]
Output audio format. Optional values: wav, pcm, mp3, opus
Latency profile. Optional values: low, normal, balanced
Prosody controls. Normalize output loudness.
Normalize English and Chinese text before synthesis.
Reference audio samples for zero-shot voice cloning. Transcript for the reference audio.
Reference audio as base64 or URL, depending on provider support.
MP3 bitrate in kbps. Optional values: 64, 128, 192
Output sample rate in Hz. Null uses format default, 44100 Hz or 48000 Hz for opus.
Expressiveness control. Value range: [0, 1]
Text segment size. Value range: [100, 300]
Opus bitrate in bps. -1000 means automatic. Optional values: -1000, 24000, 32000, 48000, 64000
Voice model ID. For multi-speaker use, pass an array matching speaker indices.
Maximum audio tokens per chunk.
Minimum characters before splitting text. Value range: [0, 100]
Penalty to reduce repeated audio patterns.
Early stopping threshold. Value range: [0, 1]
condition_on_previous_chunks
Use previous audio chunks as context.
Response
Generated audio.
Format: binary