Streaming Text To Speech
Generate speech from text using real-time streaming, allowing audio playback to begin before the entire synthesis is complete.
This endpoint streams audio data in chunks, enabling low-latency audio playback for applications requiring immediate feedback.
Streaming Format:
- WAV format: First chunk contains WAV header (size=0xFFFFFFFF for streaming) followed by raw PCM data. Subsequent chunks contain only PCM data.
- MP3 format: Each chunk contains post-processed MP3 data that can be decoded independently.
Use Cases:
- Conversational AI, chatbots and real-time voice assistants
- Interactive applications requiring immediate audio feedback
- Long-form content where waiting for full synthesis is impractical
Request Parameters:
Uses the same TTSRequest schema as the standard TTS endpoint. Set output.audio_format to “wav” or “mp3” to control the streaming format.
Authorizations
API key for authentication. You can obtain an API key from the Typecast API Console.
Body
Text-to-speech streaming request parameters
Voice identifier. Two prefixes are supported:
tc_— Built-in Typecast voices (e.g.,tc_60e5426de8b95f1d3000d7b5). See Listing all voices for available IDs.uc_— Custom voices created via Instant cloning (e.g.,uc_64a1b2c3d4e5f6a7b8c9d0e1). Only the owner of a cloned voice can use it.
Case-sensitive: must use lowercase prefix.
"tc_60e5426de8b95f1d3000d7b5"
Text to convert to speech. Minimum 1 character, maximum 2000 characters. Credits consumed based on text length. Supports multiple languages including English, Korean, Japanese, and Chinese. Special characters and punctuation are handled automatically.
1 - 2000"Everything is so incredibly perfect that I feel like I'm dreaming."
Voice model to use for speech synthesis.
- ssfm-v30: Latest model with improved prosody and additional emotion presets (recommended)
- ssfm-v21: Stable production model with reliable quality
ssfm-v30, ssfm-v21 "ssfm-v30"
Language code following ISO 639-3 standard. Case-insensitive (both "ENG" and "eng" are accepted). If not provided, will be auto-detected based on text content.
ssfm-v30 Supported Languages (37)
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | POR | Portuguese |
| BEN | Bengali | ITA | Italian | RON | Romanian |
| BUL | Bulgarian | JPN | Japanese | RUS | Russian |
| CES | Czech | KOR | Korean | SLK | Slovak |
| DAN | Danish | MSA | Malay | SPA | Spanish |
| DEU | German | NAN | Min Nan | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | NOR | Norwegian | TGL | Tagalog |
| FIN | Finnish | PAN | Punjabi | THA | Thai |
| FRA | French | POL | Polish | TUR | Turkish |
| HIN | Hindi | UKR | Ukrainian | VIE | Vietnamese |
| HRV | Croatian | YUE | Cantonese | ZHO | Chinese |
| HUN | Hungarian |
ssfm-v21 Supported Languages (27)
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | RON | Romanian |
| BUL | Bulgarian | ITA | Italian | RUS | Russian |
| CES | Czech | JPN | Japanese | SLK | Slovak |
| DAN | Danish | KOR | Korean | SPA | Spanish |
| DEU | German | MSA | Malay | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | POL | Polish | TGL | Tagalog |
| FIN | Finnish | POR | Portuguese | UKR | Ukrainian |
| FRA | French | HRV | Croatian | ZHO | Chinese |
"eng"
Emotion and style settings for the generated speech.
- SmartPrompt (ssfm-v30)
- PresetPrompt (ssfm-v30)
- Prompt (ssfm-v21)
{
"emotion_type": "smart",
"previous_text": "I feel like I'm walking on air and I just want to scream with joy!",
"next_text": "I am literally bursting with happiness and I never want this feeling to end!"
}Streaming audio output settings including pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), format (wav/mp3), and target_lufs (-70 to 0 LUFS). Note: volume is not available in streaming mode.
Unsigned integer seed for reproducible speech generation. The same seed with the same input parameters will produce identical audio output.
- Must be a non-negative integer (≥ 0). Negative values are not accepted.
- If omitted, the server generates a random seed each time, producing slight variations.
0 <= x <= 429496729542
Response
Success - Returns streaming audio data in chunks
Chunked WAV audio stream (16-bit, mono, 32000 Hz). First chunk includes WAV header with size 0xFFFFFFFF (indicating streaming), followed by raw PCM data. Subsequent chunks contain only PCM data.