Skip to main content
POST
/
v3
/
fish-audio-s2-pro-text-to-speech
Fish Audio S2 Pro Text to Speech
curl --request POST \
  --url https://api.novita.ai/v3/fish-audio-s2-pro-text-to-speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "top_p": 123,
  "format": "<string>",
  "latency": "<string>",
  "prosody": {
    "speed": 123,
    "volume": 123,
    "normalize_loudness": true
  },
  "normalize": true,
  "references": [
    {
      "text": "<string>",
      "audio": "<string>"
    }
  ],
  "mp3_bitrate": 123,
  "sample_rate": 123,
  "temperature": 123,
  "chunk_length": 123,
  "opus_bitrate": 123,
  "reference_id": "<string>",
  "max_new_tokens": 123,
  "min_chunk_length": 123,
  "repetition_penalty": 123,
  "early_stop_threshold": 123,
  "condition_on_previous_chunks": true
}
'

Documentation Index

Fetch the complete documentation index at: https://novita.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Fish Audio S2 Pro text-to-speech model converts text into natural speech with support for reference voices, sampling controls, chunking, audio formats, and prosody controls.

Request Headers

Content-Type
string
required
Supports: application/json
Authorization
string
required
Bearer authentication format, for example: Bearer {{API Key}}.

Request Body

text
string
required
Text to convert to speech. S2-Pro multi-speaker text can use tags such as <|speaker:0|>Hello<|speaker:1|>Hi.
top_p
number
Nucleus sampling diversity control.Value range: [0, 1]
format
string
default:"mp3"
Output audio format.Optional values: wav, pcm, mp3, opus
latency
string
default:"normal"
Latency profile.Optional values: low, normal, balanced
prosody
object
Prosody controls.
normalize
boolean
default:true
Normalize English and Chinese text before synthesis.
references
array
Reference audio samples for zero-shot voice cloning.
mp3_bitrate
integer
default:128
MP3 bitrate in kbps.Optional values: 64, 128, 192
sample_rate
integer
Output sample rate in Hz. Null uses format default, 44100 Hz or 48000 Hz for opus.
temperature
number
Expressiveness control.Value range: [0, 1]
chunk_length
integer
default:300
Text segment size.Value range: [100, 300]
opus_bitrate
integer
Opus bitrate in bps. -1000 means automatic.Optional values: -1000, 24000, 32000, 48000, 64000
reference_id
string
Voice model ID. For multi-speaker use, pass an array matching speaker indices.
max_new_tokens
integer
default:1024
Maximum audio tokens per chunk.
min_chunk_length
integer
default:50
Minimum characters before splitting text.Value range: [0, 100]
repetition_penalty
number
Penalty to reduce repeated audio patterns.
early_stop_threshold
number
default:1
Early stopping threshold.Value range: [0, 1]
condition_on_previous_chunks
boolean
default:true
Use previous audio chunks as context.

Response

Generated audio. Format: binary