Text To Speech with Timestamps
Generate speech from text and return word/character-level timestamps aligned with the audio. Useful for subtitle sync, per-character highlight animation, and speech-region visualization.
The request body matches the standard /v1/text-to-speech endpoint (voice_id, text, model, language, prompt, output, seed). Instead of raw audio bytes, this endpoint returns a JSON object containing base64-encoded audio plus words and characters arrays.
Use the optional granularity query parameter to return only word-level or only character-level timestamps and reduce payload size.
Language note. For languages that do not use whitespace between words — such as Japanese (
jpn) and Chinese (zho) — word-level alignment collapses the entire sentence into a single “word”. For those languages, always requestgranularity=charto receive usable per-character timestamps.
See Listing all voices for available voices.
Authorizations
API key for authentication. You can obtain an API key from the Typecast API Console.
Query Parameters
Filter for which timestamp arrays to return.
- Omitted: returns both
wordsandcharacters. word: returnswordsonly (charactersis null).char: returnscharactersonly (wordsis null).
Languages without whitespace (e.g., jpn, zho): word alignment yields a single segment covering the whole sentence, so use char to obtain meaningful timestamps.
word, char Body
Text-to-speech request parameters
Voice identifier. Two prefixes are supported:
tc_— Built-in Typecast voices (e.g.,tc_60e5426de8b95f1d3000d7b5). See Listing all voices for available IDs.uc_— Custom voices created via Instant cloning (e.g.,uc_64a1b2c3d4e5f6a7b8c9d0e1). Only the owner of a cloned voice can use it.
Case-sensitive: must use lowercase prefix.
"tc_60e5426de8b95f1d3000d7b5"
Text to convert to speech. Minimum 1 character, maximum 2000 characters. Credits consumed based on text length. Supports multiple languages including English, Korean, Japanese, and Chinese. Special characters and punctuation are handled automatically.
1 - 2000"Everything is so incredibly perfect that I feel like I'm dreaming."
Voice model to use for speech synthesis.
- ssfm-v30: Latest model with improved prosody and additional emotion presets (recommended)
- ssfm-v21: Stable production model with reliable quality
ssfm-v30, ssfm-v21 "ssfm-v30"
Language code following ISO 639-3 standard. Case-insensitive (both "ENG" and "eng" are accepted). If not provided, will be auto-detected based on text content.
ssfm-v30 Supported Languages (37)
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | POR | Portuguese |
| BEN | Bengali | ITA | Italian | RON | Romanian |
| BUL | Bulgarian | JPN | Japanese | RUS | Russian |
| CES | Czech | KOR | Korean | SLK | Slovak |
| DAN | Danish | MSA | Malay | SPA | Spanish |
| DEU | German | NAN | Min Nan | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | NOR | Norwegian | TGL | Tagalog |
| FIN | Finnish | PAN | Punjabi | THA | Thai |
| FRA | French | POL | Polish | TUR | Turkish |
| HIN | Hindi | UKR | Ukrainian | VIE | Vietnamese |
| HRV | Croatian | YUE | Cantonese | ZHO | Chinese |
| HUN | Hungarian |
ssfm-v21 Supported Languages (27)
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | RON | Romanian |
| BUL | Bulgarian | ITA | Italian | RUS | Russian |
| CES | Czech | JPN | Japanese | SLK | Slovak |
| DAN | Danish | KOR | Korean | SPA | Spanish |
| DEU | German | MSA | Malay | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | POL | Polish | TGL | Tagalog |
| FIN | Finnish | POR | Portuguese | UKR | Ukrainian |
| FRA | French | HRV | Croatian | ZHO | Chinese |
Timestamp endpoint note. For languages without inter-word whitespace — Japanese (
jpn) and Chinese (zho) — word-level alignment collapses the whole sentence into a single segment. Always pair these languages withgranularity=charto receive usable per-character timestamps.
"eng"
Emotion and style settings for the generated speech.
- SmartPrompt (ssfm-v30)
- PresetPrompt (ssfm-v30)
- Prompt (ssfm-v21)
{
"emotion_type": "smart",
"previous_text": "I feel like I'm walking on air and I just want to scream with joy!",
"next_text": "I am literally bursting with happiness and I never want this feeling to end!"
}Audio output settings including volume (0-200), pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), and format (wav/mp3) for controlling the final audio characteristics
Unsigned integer seed for reproducible speech generation. The same seed with the same input parameters will produce identical audio output.
- Must be a non-negative integer (≥ 0). Negative values are not accepted.
- If omitted, the server generates a random seed each time, producing slight variations.
0 <= x <= 429496729542
Response
Success - Returns base64 audio and timestamps
Response payload for POST /v1/text-to-speech/with-timestamps — base64-encoded audio plus per-word and per-character timestamps aligned with the generated speech.
Base64-encoded audio bytes. Decode and write to a file using the audio_format extension.
Audio encoding format of the bytes in audio — either wav or mp3, mirroring the request's output.audio_format.
wav, mp3 Length of the generated audio in seconds.
Word-level timestamps (with attached punctuation). null when the request uses granularity=char.
Character-level timestamps (including punctuation and whitespace). null when the request uses granularity=word.