Skip to main content
POST
https://api.novita.ai
/
v4beta
/
txt2speech
Fish Audio Text to Speech
curl --request POST \
  --url https://api.novita.ai/v4beta/txt2speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "temperature": 123,
  "top_p": 123,
  "references": {
    "text": "<string>"
  },
  "reference_id": {},
  "prosody": {
    "speed": 123,
    "volume": 123
  },
  "chunk_length": 123,
  "normalize": true,
  "format": {},
  "sample_rate": {},
  "mp3_bitrate": {},
  "opus_bitrate": {},
  "latency": {}
}
'
For best results, upload reference audio using the create model before using this one. This improves speech quality and reduces latency.
Fish Audio converts text into speech. Audio formats supported:
  • WAV / PCM
    • Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • 16-bit, mono
  • MP3
    • Sample Rate: 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • mono
    • Bitrate: 64kbps, 128kbps (default), 192kbps
  • Opus
    • Sample Rate: 48kHz
    • Default Sample Rate: 48kHz
    • mono
    • Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Request Headers

Content-Type
string
required
Enum: application/json
Authorization
string
required
Bearer authentication format, for example: Bearer {{API Key}}.
model
enum<string>
default:"s1"
Specify which TTS model to use. Only supports model: s1.

Request Body

text
string
required
Text to be converted to speech.
temperature
number
Controls randomness in the speech generation. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.1) make it more deterministic. We recommend 0.9 for s1 model.Required range: 0 <= x <= 1
top_p
number
Controls diversity via nucleus sampling. Lower values (e.g., 0.1) make the output more focused, while higher values (e.g., 1.0) allow more diversity. We recommend 0.9 for s1 model.Required range: 0 <= x <= 1
references
ReferenceAudio · object[] | null
References to be used for the speech, this requires MessagePack serialization, this will override reference_voices and reference_texts.
reference_id
string | null
ID of the reference model to be used for the speech.
prosody
ProsodyControl · object
Prosody to be used for the speech.
chunk_length
integer
default:200
Chunk length to be used for the speech.Required range: 100 <= x <= 300
normalize
boolean
default:true
Whether to normalize the speech, this will reduce the latency but may reduce performance on numbers and dates.
format
enum<string>
default:"mp3"
Format to be used for the speech.Available options: wav, pcm, mp3, opus
sample_rate
integer | null
Sample rate to be used for the speech.
mp3_bitrate
enum<integer>
default:128
MP3 Bitrate to be used for the speech.Available options: 64, 128, 192
opus_bitrate
enum<integer>
default:32
Opus Bitrate to be used for the speech.Available options: -1000, 24, 32, 48, 64
latency
enum<string>
default:"normal"
Latency to be used for the speech, balanced will reduce the latency but may lead to performance degradation.Available options: normal, balanced

Response

The API will directly return the audio stream in the format specified by the format parameter (default: mp3).