Fish Audio S2 Pro Text to Speech
Audio
Fish Audio S2 Pro Text to Speech
POST
Fish Audio S2 Pro Text to Speech
Fish Audio S2 Pro text-to-speech model converts text into natural speech with support for reference voices, sampling controls, chunking, audio formats, and prosody controls.
Request Headers
Supports:
application/jsonBearer authentication format, for example: Bearer {{API Key}}.
Request Body
Text to convert to speech. S2-Pro multi-speaker text can use tags such as <|speaker:0|>Hello<|speaker:1|>Hi.
Nucleus sampling diversity control.Value range: [0, 1]
Output audio format.Optional values:
wav, pcm, mp3, opusLatency profile.Optional values:
low, normal, balancedProsody controls.
Normalize English and Chinese text before synthesis.
Reference audio samples for zero-shot voice cloning.
MP3 bitrate in kbps.Optional values:
64, 128, 192Output sample rate in Hz. Null uses format default, 44100 Hz or 48000 Hz for opus.
Expressiveness control.Value range: [0, 1]
Text segment size.Value range: [100, 300]
Opus bitrate in bps. -1000 means automatic.Optional values:
-1000, 24000, 32000, 48000, 64000Voice model ID. For multi-speaker use, pass an array matching speaker indices.
Maximum audio tokens per chunk.
Minimum characters before splitting text.Value range: [0, 100]
Penalty to reduce repeated audio patterns.
Early stopping threshold.Value range: [0, 1]
Use previous audio chunks as context.
Response
Generated audio. Format:binaryLast modified on June 12, 2026