Generate speech from text and return word/character-level timestamps aligned with the audio. Useful for subtitle sync, per-character highlight animation, and speech-region visualization.
The request body matches the standard /v1/text-to-speech endpoint (voice_id, text, model, language, prompt, output, seed). Instead of raw audio bytes, this endpoint returns a JSON object containing base64-encoded audio plus words and characters arrays.
Use the optional granularity query parameter to return only word-level or only character-level timestamps and reduce payload size.
Language note. For languages that do not use whitespace between words — such as Japanese (
jpn) and Chinese (zho) — word-level alignment collapses the entire sentence into a single “word”. For those languages, always requestgranularity=charto receive usable per-character timestamps.
See Listing all voices for available voices.
Documentation Index
Fetch the complete documentation index at: https://typecast.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
API key for authentication. You can obtain an API key from the Typecast API Console.
Filter for which timestamp arrays to return.
words and characters.word: returns words only (characters is null).char: returns characters only (words is null).Languages without whitespace (e.g., jpn, zho): word alignment yields a single segment covering the whole sentence, so use char to obtain meaningful timestamps.
word, char Text-to-speech request parameters
Voice ID in format 'tc_' followed by a unique identifier (e.g., 'tc_60e5426de8b95f1d3000d7b5'). Case-sensitive: must use lowercase (tc_xxx). See Listing all voices for available voices.
"tc_60e5426de8b95f1d3000d7b5"
Text to convert to speech. Minimum 1 character, maximum 2000 characters. Credits consumed based on text length. Supports multiple languages including English, Korean, Japanese, and Chinese. Special characters and punctuation are handled automatically.
1 - 2000"Everything is so incredibly perfect that I feel like I'm dreaming."
Voice model to use for speech synthesis.
ssfm-v30, ssfm-v21 "ssfm-v30"
Language code following ISO 639-3 standard. Case-insensitive (both "ENG" and "eng" are accepted). If not provided, will be auto-detected based on text content.
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | POR | Portuguese |
| BEN | Bengali | ITA | Italian | RON | Romanian |
| BUL | Bulgarian | JPN | Japanese | RUS | Russian |
| CES | Czech | KOR | Korean | SLK | Slovak |
| DAN | Danish | MSA | Malay | SPA | Spanish |
| DEU | German | NAN | Min Nan | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | NOR | Norwegian | TGL | Tagalog |
| FIN | Finnish | PAN | Punjabi | THA | Thai |
| FRA | French | POL | Polish | TUR | Turkish |
| HIN | Hindi | UKR | Ukrainian | VIE | Vietnamese |
| HRV | Croatian | YUE | Cantonese | ZHO | Chinese |
| HUN | Hungarian |
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| ARA | Arabic | IND | Indonesian | RON | Romanian |
| BUL | Bulgarian | ITA | Italian | RUS | Russian |
| CES | Czech | JPN | Japanese | SLK | Slovak |
| DAN | Danish | KOR | Korean | SPA | Spanish |
| DEU | German | MSA | Malay | SWE | Swedish |
| ELL | Greek | NLD | Dutch | TAM | Tamil |
| ENG | English | POL | Polish | TGL | Tagalog |
| FIN | Finnish | POR | Portuguese | UKR | Ukrainian |
| FRA | French | HRV | Croatian | ZHO | Chinese |
Timestamp endpoint note. For languages without inter-word whitespace — Japanese (
jpn) and Chinese (zho) — word-level alignment collapses the whole sentence into a single segment. Always pair these languages withgranularity=charto receive usable per-character timestamps.
"eng"
Emotion and style settings for the generated speech.
{
"emotion_type": "smart",
"previous_text": "I feel like I'm walking on air and I just want to scream with joy!",
"next_text": "I am literally bursting with happiness and I never want this feeling to end!"
}Audio output settings including volume (0-200), pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), and format (wav/mp3) for controlling the final audio characteristics
Unsigned integer seed for reproducible speech generation. The same seed with the same input parameters will produce identical audio output.
0 <= x <= 429496729542
Success - Returns base64 audio and timestamps
Response payload for POST /v1/text-to-speech/with-timestamps — base64-encoded audio plus per-word and per-character timestamps aligned with the generated speech.
Base64-encoded audio bytes. Decode and write to a file using the audio_format extension.
Audio encoding format of the bytes in audio — either wav or mp3, mirroring the request's output.audio_format.
wav, mp3 Length of the generated audio in seconds.
Word-level timestamps (with attached punctuation). null when the request uses granularity=char.
Character-level timestamps (including punctuation and whitespace). null when the request uses granularity=word.