Skip to main content
POST
/
v1
/
text-to-speech
/
with-timestamps
cURL
curl --request POST \
  --url https://api.typecast.ai/v1/text-to-speech/with-timestamps \
  --header 'Content-Type: application/json' \
  --header 'X-API-KEY: <api-key>' \
  --data @- <<EOF
{
  "voice_id": "tc_60e5426de8b95f1d3000d7b5",
  "text": "Try a 5-minute stretch when you lose focus.",
  "model": "ssfm-v30",
  "language": "eng",
  "prompt": {
    "emotion_type": "preset",
    "emotion_preset": "normal",
    "emotion_intensity": 1.0
  }
}
EOF
{
  "audio": "UklGRs...(base64-encoded audio omitted)",
  "audio_format": "wav",
  "audio_duration": 3.38,
  "words": [
    {
      "text": "Try",
      "start": 0.08,
      "end": 0.38
    },
    {
      "text": "a",
      "start": 0.42,
      "end": 0.52
    },
    {
      "text": "5-minute",
      "start": 0.56,
      "end": 1.26
    },
    {
      "text": "stretch",
      "start": 1.3,
      "end": 1.88
    },
    {
      "text": "when",
      "start": 1.92,
      "end": 2.18
    },
    {
      "text": "you",
      "start": 2.22,
      "end": 2.42
    },
    {
      "text": "lose",
      "start": 2.46,
      "end": 2.78
    },
    {
      "text": "focus.",
      "start": 2.82,
      "end": 3.38
    }
  ],
  "characters": [
    {
      "text": "T",
      "start": 0.08,
      "end": 0.18
    },
    {
      "text": "r",
      "start": 0.18,
      "end": 0.27
    },
    {
      "text": "y",
      "start": 0.27,
      "end": 0.38
    },
    {
      "text": " ",
      "start": 0.38,
      "end": 0.42
    },
    {
      "text": "a",
      "start": 0.42,
      "end": 0.52
    },
    {
      "text": " ",
      "start": 0.52,
      "end": 0.56
    },
    {
      "text": "5",
      "start": 0.56,
      "end": 0.7
    },
    {
      "text": "-",
      "start": 0.7,
      "end": 0.76
    },
    {
      "text": "m",
      "start": 0.76,
      "end": 0.86
    },
    {
      "text": "i",
      "start": 0.86,
      "end": 0.94
    },
    {
      "text": "n",
      "start": 0.94,
      "end": 1.04
    },
    {
      "text": "u",
      "start": 1.04,
      "end": 1.12
    },
    {
      "text": "t",
      "start": 1.12,
      "end": 1.2
    },
    {
      "text": "e",
      "start": 1.2,
      "end": 1.26
    },
    {
      "text": " ",
      "start": 1.26,
      "end": 1.3
    },
    {
      "text": "s",
      "start": 1.3,
      "end": 1.4
    },
    {
      "text": "t",
      "start": 1.4,
      "end": 1.47
    },
    {
      "text": "r",
      "start": 1.47,
      "end": 1.56
    },
    {
      "text": "e",
      "start": 1.56,
      "end": 1.64
    },
    {
      "text": "t",
      "start": 1.64,
      "end": 1.72
    },
    {
      "text": "c",
      "start": 1.72,
      "end": 1.8
    },
    {
      "text": "h",
      "start": 1.8,
      "end": 1.88
    },
    {
      "text": " ",
      "start": 1.88,
      "end": 1.92
    },
    {
      "text": "w",
      "start": 1.92,
      "end": 2.02
    },
    {
      "text": "h",
      "start": 2.02,
      "end": 2.08
    },
    {
      "text": "e",
      "start": 2.08,
      "end": 2.14
    },
    {
      "text": "n",
      "start": 2.14,
      "end": 2.18
    },
    {
      "text": " ",
      "start": 2.18,
      "end": 2.22
    },
    {
      "text": "y",
      "start": 2.22,
      "end": 2.3
    },
    {
      "text": "o",
      "start": 2.3,
      "end": 2.38
    },
    {
      "text": "u",
      "start": 2.38,
      "end": 2.42
    },
    {
      "text": " ",
      "start": 2.42,
      "end": 2.46
    },
    {
      "text": "l",
      "start": 2.46,
      "end": 2.56
    },
    {
      "text": "o",
      "start": 2.56,
      "end": 2.64
    },
    {
      "text": "s",
      "start": 2.64,
      "end": 2.72
    },
    {
      "text": "e",
      "start": 2.72,
      "end": 2.78
    },
    {
      "text": " ",
      "start": 2.78,
      "end": 2.82
    },
    {
      "text": "f",
      "start": 2.82,
      "end": 2.92
    },
    {
      "text": "o",
      "start": 2.92,
      "end": 3.02
    },
    {
      "text": "c",
      "start": 3.02,
      "end": 3.12
    },
    {
      "text": "u",
      "start": 3.12,
      "end": 3.22
    },
    {
      "text": "s",
      "start": 3.22,
      "end": 3.32
    },
    {
      "text": ".",
      "start": 3.32,
      "end": 3.38
    }
  ]
}

Documentation Index

Fetch the complete documentation index at: https://typecast.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Authorizations

X-API-KEY
string
header
required

API key for authentication. You can obtain an API key from the Typecast API Console.

Query Parameters

granularity
enum<string>

Filter for which timestamp arrays to return.

  • Omitted: returns both words and characters.
  • word: returns words only (characters is null).
  • char: returns characters only (words is null).

Languages without whitespace (e.g., jpn, zho): word alignment yields a single segment covering the whole sentence, so use char to obtain meaningful timestamps.

Available options:
word,
char

Body

application/json

Text-to-speech request parameters

voice_id
string
required

Voice ID in format 'tc_' followed by a unique identifier (e.g., 'tc_60e5426de8b95f1d3000d7b5'). Case-sensitive: must use lowercase (tc_xxx). See Listing all voices for available voices.

Example:

"tc_60e5426de8b95f1d3000d7b5"

text
string
required

Text to convert to speech. Minimum 1 character, maximum 2000 characters. Credits consumed based on text length. Supports multiple languages including English, Korean, Japanese, and Chinese. Special characters and punctuation are handled automatically.

Required string length: 1 - 2000
Example:

"Everything is so incredibly perfect that I feel like I'm dreaming."

model
enum<string>
required

Voice model to use for speech synthesis.

  • ssfm-v30: Latest model with improved prosody and additional emotion presets (recommended)
  • ssfm-v21: Stable production model with reliable quality
Available options:
ssfm-v30,
ssfm-v21
Example:

"ssfm-v30"

language
string

Language code following ISO 639-3 standard. Case-insensitive (both "ENG" and "eng" are accepted). If not provided, will be auto-detected based on text content.

ssfm-v30 Supported Languages (37)
CodeLanguageCodeLanguageCodeLanguage
ARAArabicINDIndonesianPORPortuguese
BENBengaliITAItalianRONRomanian
BULBulgarianJPNJapaneseRUSRussian
CESCzechKORKoreanSLKSlovak
DANDanishMSAMalaySPASpanish
DEUGermanNANMin NanSWESwedish
ELLGreekNLDDutchTAMTamil
ENGEnglishNORNorwegianTGLTagalog
FINFinnishPANPunjabiTHAThai
FRAFrenchPOLPolishTURTurkish
HINHindiUKRUkrainianVIEVietnamese
HRVCroatianYUECantoneseZHOChinese
HUNHungarian
ssfm-v21 Supported Languages (27)
CodeLanguageCodeLanguageCodeLanguage
ARAArabicINDIndonesianRONRomanian
BULBulgarianITAItalianRUSRussian
CESCzechJPNJapaneseSLKSlovak
DANDanishKORKoreanSPASpanish
DEUGermanMSAMalaySWESwedish
ELLGreekNLDDutchTAMTamil
ENGEnglishPOLPolishTGLTagalog
FINFinnishPORPortugueseUKRUkrainian
FRAFrenchHRVCroatianZHOChinese

Timestamp endpoint note. For languages without inter-word whitespace — Japanese (jpn) and Chinese (zho) — word-level alignment collapses the whole sentence into a single segment. Always pair these languages with granularity=char to receive usable per-character timestamps.

Example:

"eng"

prompt
SmartPrompt (ssfm-v30) · object

Emotion and style settings for the generated speech.

Example:
{
"emotion_type": "smart",
"previous_text": "I feel like I'm walking on air and I just want to scream with joy!",
"next_text": "I am literally bursting with happiness and I never want this feeling to end!"
}
output
Output · object

Audio output settings including volume (0-200), pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), and format (wav/mp3) for controlling the final audio characteristics

seed
integer<uint32>

Unsigned integer seed for reproducible speech generation. The same seed with the same input parameters will produce identical audio output.

  • Must be a non-negative integer (≥ 0). Negative values are not accepted.
  • If omitted, the server generates a random seed each time, producing slight variations.
Required range: 0 <= x <= 4294967295
Example:

42

Response

Success - Returns base64 audio and timestamps

Response payload for POST /v1/text-to-speech/with-timestamps — base64-encoded audio plus per-word and per-character timestamps aligned with the generated speech.

audio
string
required

Base64-encoded audio bytes. Decode and write to a file using the audio_format extension.

audio_format
enum<string>
required

Audio encoding format of the bytes in audio — either wav or mp3, mirroring the request's output.audio_format.

Available options:
wav,
mp3
audio_duration
number
required

Length of the generated audio in seconds.

words
AlignmentSegmentWord · object[] | null
required

Word-level timestamps (with attached punctuation). null when the request uses granularity=char.

characters
AlignmentSegmentCharacter · object[] | null
required

Character-level timestamps (including punctuation and whitespace). null when the request uses granularity=word.