Text To Speech with Timestamps

curl --request POST \ --url https://api.typecast.ai/v1/text-to-speech/with-timestamps \ --header 'Content-Type: application/json' \ --header 'X-API-KEY: <api-key>' \ --data @- <<EOF { "voice_id": "tc_60e5426de8b95f1d3000d7b5", "text": "Try a 5-minute stretch when you lose focus.", "model": "ssfm-v30", "language": "eng", "prompt": { "emotion_type": "preset", "emotion_preset": "normal", "emotion_intensity": 1.0 } } EOF

{ "audio": "UklGRs...(base64-encoded audio omitted)", "audio_format": "wav", "audio_duration": 3.38, "words": [ { "text": "Try", "start": 0.08, "end": 0.38 }, { "text": "a", "start": 0.42, "end": 0.52 }, { "text": "5-minute", "start": 0.56, "end": 1.26 }, { "text": "stretch", "start": 1.3, "end": 1.88 }, { "text": "when", "start": 1.92, "end": 2.18 }, { "text": "you", "start": 2.22, "end": 2.42 }, { "text": "lose", "start": 2.46, "end": 2.78 }, { "text": "focus.", "start": 2.82, "end": 3.38 } ], "characters": [ { "text": "T", "start": 0.08, "end": 0.18 }, { "text": "r", "start": 0.18, "end": 0.27 }, { "text": "y", "start": 0.27, "end": 0.38 }, { "text": " ", "start": 0.38, "end": 0.42 }, { "text": "a", "start": 0.42, "end": 0.52 }, { "text": " ", "start": 0.52, "end": 0.56 }, { "text": "5", "start": 0.56, "end": 0.7 }, { "text": "-", "start": 0.7, "end": 0.76 }, { "text": "m", "start": 0.76, "end": 0.86 }, { "text": "i", "start": 0.86, "end": 0.94 }, { "text": "n", "start": 0.94, "end": 1.04 }, { "text": "u", "start": 1.04, "end": 1.12 }, { "text": "t", "start": 1.12, "end": 1.2 }, { "text": "e", "start": 1.2, "end": 1.26 }, { "text": " ", "start": 1.26, "end": 1.3 }, { "text": "s", "start": 1.3, "end": 1.4 }, { "text": "t", "start": 1.4, "end": 1.47 }, { "text": "r", "start": 1.47, "end": 1.56 }, { "text": "e", "start": 1.56, "end": 1.64 }, { "text": "t", "start": 1.64, "end": 1.72 }, { "text": "c", "start": 1.72, "end": 1.8 }, { "text": "h", "start": 1.8, "end": 1.88 }, { "text": " ", "start": 1.88, "end": 1.92 }, { "text": "w", "start": 1.92, "end": 2.02 }, { "text": "h", "start": 2.02, "end": 2.08 }, { "text": "e", "start": 2.08, "end": 2.14 }, { "text": "n", "start": 2.14, "end": 2.18 }, { "text": " ", "start": 2.18, "end": 2.22 }, { "text": "y", "start": 2.22, "end": 2.3 }, { "text": "o", "start": 2.3, "end": 2.38 }, { "text": "u", "start": 2.38, "end": 2.42 }, { "text": " ", "start": 2.42, "end": 2.46 }, { "text": "l", "start": 2.46, "end": 2.56 }, { "text": "o", "start": 2.56, "end": 2.64 }, { "text": "s", "start": 2.64, "end": 2.72 }, { "text": "e", "start": 2.72, "end": 2.78 }, { "text": " ", "start": 2.78, "end": 2.82 }, { "text": "f", "start": 2.82, "end": 2.92 }, { "text": "o", "start": 2.92, "end": 3.02 }, { "text": "c", "start": 3.02, "end": 3.12 }, { "text": "u", "start": 3.12, "end": 3.22 }, { "text": "s", "start": 3.22, "end": 3.32 }, { "text": ".", "start": 3.32, "end": 3.38 } ] }

Authorizations

X-API-KEY

string

header

required

API key for authentication. You can obtain an API key from the Typecast API Console.

Query Parameters

granularity

enum<string>

Filter for which timestamp arrays to return.

Omitted: returns both words and characters.
word: returns words only (characters is null).
char: returns characters only (words is null).

Languages without whitespace (e.g., jpn, zho): word alignment yields a single segment covering the whole sentence, so use char to obtain meaningful timestamps.

Available options:

word,

char

Body

application/json

Text-to-speech request parameters

voice_id

string

required

Voice identifier. Two prefixes are supported:

tc_ — Built-in Typecast voices (e.g., tc_60e5426de8b95f1d3000d7b5). See Listing all voices for available IDs.
uc_ — Custom voices created via Instant cloning (e.g., uc_64a1b2c3d4e5f6a7b8c9d0e1). Only the owner of a cloned voice can use it.

Case-sensitive: must use lowercase prefix.

Example:

"tc_60e5426de8b95f1d3000d7b5"

text

string

required

Text to convert to speech. Minimum 1 character, maximum 2000 characters. Credits consumed based on text length. Supports multiple languages including English, Korean, Japanese, and Chinese. Special characters and punctuation are handled automatically.

Required string length: 1 - 2000

Example:

"Everything is so incredibly perfect that I feel like I'm dreaming."

model

enum<string>

required

Voice model to use for speech synthesis.

ssfm-v30: Latest model with improved prosody and additional emotion presets (recommended)
ssfm-v21: Stable production model with reliable quality

Available options:

ssfm-v30,

ssfm-v21

Example:

"ssfm-v30"

language

string

Language code following ISO 639-3 standard. Case-insensitive (both "ENG" and "eng" are accepted). If not provided, will be auto-detected based on text content.

ssfm-v30 Supported Languages (37)

Code	Language	Code	Language	Code	Language
ARA	Arabic	IND	Indonesian	POR	Portuguese
BEN	Bengali	ITA	Italian	RON	Romanian
BUL	Bulgarian	JPN	Japanese	RUS	Russian
CES	Czech	KOR	Korean	SLK	Slovak
DAN	Danish	MSA	Malay	SPA	Spanish
DEU	German	NAN	Min Nan	SWE	Swedish
ELL	Greek	NLD	Dutch	TAM	Tamil
ENG	English	NOR	Norwegian	TGL	Tagalog
FIN	Finnish	PAN	Punjabi	THA	Thai
FRA	French	POL	Polish	TUR	Turkish
HIN	Hindi	UKR	Ukrainian	VIE	Vietnamese
HRV	Croatian	YUE	Cantonese	ZHO	Chinese
HUN	Hungarian

ssfm-v21 Supported Languages (27)

Code	Language	Code	Language	Code	Language
ARA	Arabic	IND	Indonesian	RON	Romanian
BUL	Bulgarian	ITA	Italian	RUS	Russian
CES	Czech	JPN	Japanese	SLK	Slovak
DAN	Danish	KOR	Korean	SPA	Spanish
DEU	German	MSA	Malay	SWE	Swedish
ELL	Greek	NLD	Dutch	TAM	Tamil
ENG	English	POL	Polish	TGL	Tagalog
FIN	Finnish	POR	Portuguese	UKR	Ukrainian
FRA	French	HRV	Croatian	ZHO	Chinese

Timestamp endpoint note. For languages without inter-word whitespace — Japanese (jpn) and Chinese (zho) — word-level alignment collapses the whole sentence into a single segment. Always pair these languages with granularity=char to receive usable per-character timestamps.

Example:

"eng"

prompt

SmartPrompt (ssfm-v30) · object

Emotion and style settings for the generated speech.

SmartPrompt (ssfm-v30)
PresetPrompt (ssfm-v30)
Prompt (ssfm-v21)

Show child attributes

Example:

{
  "emotion_type": "smart",
  "previous_text": "I feel like I'm walking on air and I just want to scream with joy!",
  "next_text": "I am literally bursting with happiness and I never want this feeling to end!"
}

output

Output · object

Audio output settings including volume (0-200), pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), and format (wav/mp3) for controlling the final audio characteristics

Show child attributes

seed

integer<uint32>

Unsigned integer seed for reproducible speech generation. The same seed with the same input parameters will produce identical audio output.

Must be a non-negative integer (≥ 0). Negative values are not accepted.
If omitted, the server generates a random seed each time, producing slight variations.

Required range: 0 <= x <= 4294967295

Example:

42

Response

Success - Returns base64 audio and timestamps

Response payload for POST /v1/text-to-speech/with-timestamps — base64-encoded audio plus per-word and per-character timestamps aligned with the generated speech.

audio

string

required

Base64-encoded audio bytes. Decode and write to a file using the audio_format extension.

audio_format

enum<string>

required

Audio encoding format of the bytes in audio — either wav or mp3, mirroring the request's output.audio_format.

Available options:

wav,

mp3

audio_duration

number

required

Length of the generated audio in seconds.

words

AlignmentSegmentWord · object[] | null

required

Word-level timestamps (with attached punctuation). null when the request uses granularity=char.

Show child attributes

characters

AlignmentSegmentCharacter · object[] | null

required

Character-level timestamps (including punctuation and whitespace). null when the request uses granularity=word.

Show child attributes

Text-to-Speech

Voices

Subscription

Authorizations

Query Parameters

Body

Response

Text-to-Speech

Voices

Subscription

Documentation Index

Authorizations

Query Parameters

Body

Response