> ## Documentation Index
> Fetch the complete documentation index at: https://typecast.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Text To Speech with Timestamps

> Generate speech from text **and** return word/character-level timestamps aligned with the audio. Useful for subtitle sync, per-character highlight animation, and speech-region visualization.

The request body matches the standard `/v1/text-to-speech` endpoint (voice_id, text, model, language, prompt, output, seed). Instead of raw audio bytes, this endpoint returns a JSON object containing base64-encoded audio plus `words` and `characters` arrays.

Use the optional `granularity` query parameter to return only word-level or only character-level timestamps and reduce payload size.

> **Language note.** For languages that do not use whitespace between words — such as Japanese (`jpn`) and Chinese (`zho`) — word-level alignment collapses the entire sentence into a single "word". For those languages, always request `granularity=char` to receive usable per-character timestamps.

See [Listing all voices](/docs/api-reference/voices/list-voices) for available voices.



## OpenAPI

````yaml /api-reference/openapi.json post /v1/text-to-speech/with-timestamps
openapi: 3.1.0
info:
  title: Typecast API
  version: 0.1.2
  x-logo:
    url: https://typecast.ai/_ipx/_/image/logo/tc_logo.webp
servers:
  - url: https://api.typecast.ai
    description: Production server
security:
  - ApiKeyAuth: []
paths:
  /v1/text-to-speech/with-timestamps:
    post:
      tags:
        - Text-to-Speech
      summary: Text To Speech with Timestamps
      description: >-
        Generate speech from text **and** return word/character-level timestamps
        aligned with the audio. Useful for subtitle sync, per-character
        highlight animation, and speech-region visualization.


        The request body matches the standard `/v1/text-to-speech` endpoint
        (voice_id, text, model, language, prompt, output, seed). Instead of raw
        audio bytes, this endpoint returns a JSON object containing
        base64-encoded audio plus `words` and `characters` arrays.


        Use the optional `granularity` query parameter to return only word-level
        or only character-level timestamps and reduce payload size.


        > **Language note.** For languages that do not use whitespace between
        words — such as Japanese (`jpn`) and Chinese (`zho`) — word-level
        alignment collapses the entire sentence into a single "word". For those
        languages, always request `granularity=char` to receive usable
        per-character timestamps.


        See [Listing all voices](/docs/api-reference/voices/list-voices) for
        available voices.
      operationId: text_to_speech_with_timestamps_v1_text_to_speech_with_timestamps_post
      parameters:
        - name: granularity
          in: query
          required: false
          schema:
            type: string
            enum:
              - word
              - char
          description: >-
            Filter for which timestamp arrays to return.

            - Omitted: returns both `words` and `characters`.

            - `word`: returns `words` only (`characters` is null).

            - `char`: returns `characters` only (`words` is null).


            **Languages without whitespace (e.g., `jpn`, `zho`):** `word`
            alignment yields a single segment covering the whole sentence, so
            use `char` to obtain meaningful timestamps.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/TTSRequestWith-timestamps'
      responses:
        '200':
          description: Success - Returns base64 audio and timestamps
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TTSWithTimestampsResponse'
              example:
                audio: UklGRs...(base64-encoded audio omitted)
                audio_format: wav
                audio_duration: 3.38
                words:
                  - text: Try
                    start: 0.08
                    end: 0.38
                  - text: a
                    start: 0.42
                    end: 0.52
                  - text: 5-minute
                    start: 0.56
                    end: 1.26
                  - text: stretch
                    start: 1.3
                    end: 1.88
                  - text: when
                    start: 1.92
                    end: 2.18
                  - text: you
                    start: 2.22
                    end: 2.42
                  - text: lose
                    start: 2.46
                    end: 2.78
                  - text: focus.
                    start: 2.82
                    end: 3.38
                characters:
                  - text: T
                    start: 0.08
                    end: 0.18
                  - text: r
                    start: 0.18
                    end: 0.27
                  - text: 'y'
                    start: 0.27
                    end: 0.38
                  - text: ' '
                    start: 0.38
                    end: 0.42
                  - text: a
                    start: 0.42
                    end: 0.52
                  - text: ' '
                    start: 0.52
                    end: 0.56
                  - text: '5'
                    start: 0.56
                    end: 0.7
                  - text: '-'
                    start: 0.7
                    end: 0.76
                  - text: m
                    start: 0.76
                    end: 0.86
                  - text: i
                    start: 0.86
                    end: 0.94
                  - text: 'n'
                    start: 0.94
                    end: 1.04
                  - text: u
                    start: 1.04
                    end: 1.12
                  - text: t
                    start: 1.12
                    end: 1.2
                  - text: e
                    start: 1.2
                    end: 1.26
                  - text: ' '
                    start: 1.26
                    end: 1.3
                  - text: s
                    start: 1.3
                    end: 1.4
                  - text: t
                    start: 1.4
                    end: 1.47
                  - text: r
                    start: 1.47
                    end: 1.56
                  - text: e
                    start: 1.56
                    end: 1.64
                  - text: t
                    start: 1.64
                    end: 1.72
                  - text: c
                    start: 1.72
                    end: 1.8
                  - text: h
                    start: 1.8
                    end: 1.88
                  - text: ' '
                    start: 1.88
                    end: 1.92
                  - text: w
                    start: 1.92
                    end: 2.02
                  - text: h
                    start: 2.02
                    end: 2.08
                  - text: e
                    start: 2.08
                    end: 2.14
                  - text: 'n'
                    start: 2.14
                    end: 2.18
                  - text: ' '
                    start: 2.18
                    end: 2.22
                  - text: 'y'
                    start: 2.22
                    end: 2.3
                  - text: o
                    start: 2.3
                    end: 2.38
                  - text: u
                    start: 2.38
                    end: 2.42
                  - text: ' '
                    start: 2.42
                    end: 2.46
                  - text: l
                    start: 2.46
                    end: 2.56
                  - text: o
                    start: 2.56
                    end: 2.64
                  - text: s
                    start: 2.64
                    end: 2.72
                  - text: e
                    start: 2.72
                    end: 2.78
                  - text: ' '
                    start: 2.78
                    end: 2.82
                  - text: f
                    start: 2.82
                    end: 2.92
                  - text: o
                    start: 2.92
                    end: 3.02
                  - text: c
                    start: 3.02
                    end: 3.12
                  - text: u
                    start: 3.12
                    end: 3.22
                  - text: s
                    start: 3.22
                    end: 3.32
                  - text: .
                    start: 3.32
                    end: 3.38
        '400':
          description: Bad Request - Invalid parameters
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid voice_id
        '401':
          description: Unauthorized - Authentication failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid API key
        '402':
          description: Payment Required - Insufficient credits
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Insufficient credit
        '404':
          description: Not Found - Voice model not available
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Voice not found
        '422':
          description: Validation Error - Request validation failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid request format
        '429':
          description: Too Many Requests - Rate limit exceeded
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Too many requests
        '500':
          description: Internal Server Error - TTS generation or timestamp alignment failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: An unexpected error occurred
      x-codeSamples:
        - lang: cURL
          label: cURL
          source: |
            curl --request POST \
              --url https://api.typecast.ai/v1/text-to-speech/with-timestamps \
              --header 'Content-Type: application/json' \
              --header 'X-API-KEY: <api-key>' \
              --data @- <<EOF
            {
              "voice_id": "tc_60e5426de8b95f1d3000d7b5",
              "text": "Try a 5-minute stretch when you lose focus.",
              "model": "ssfm-v30",
              "language": "eng",
              "prompt": {
                "emotion_type": "preset",
                "emotion_preset": "normal",
                "emotion_intensity": 1.0
              }
            }
            EOF
        - lang: Python
          label: Python (requests)
          source: >
            import base64

            import requests


            API_HOST = "https://api.typecast.ai"

            headers = {
                "X-API-KEY": "<api-key>",
                "Content-Type": "application/json",
            }

            payload = {
                "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                "text": "Try a 5-minute stretch when you lose focus.",
                "model": "ssfm-v30",
                "language": "eng",
                "prompt": {
                    "emotion_type": "preset",
                    "emotion_preset": "normal",
                    "emotion_intensity": 1.0,
                },
            }


            response = requests.post(
                f"{API_HOST}/v1/text-to-speech/with-timestamps",
                headers=headers,
                json=payload,
                timeout=60,
            )

            response.raise_for_status()

            data = response.json()


            with open("output.wav", "wb") as f:
                f.write(base64.b64decode(data["audio"]))
            print(f"Saved {len(data['audio'])} base64 chars;
            duration={data['audio_duration']}s")

            for w in (data.get("words") or [])[:3]:
                print(f"  word: {w['text']!r} {w['start']:.3f}s - {w['end']:.3f}s")
components:
  schemas:
    TTSRequestWith-timestamps:
      type: object
      properties:
        voice_id:
          type: string
          title: Voice Id
          description: >-
            Voice identifier. Two prefixes are supported:


            - `tc_` — Built-in Typecast voices (e.g.,
            `tc_60e5426de8b95f1d3000d7b5`). See [Listing all
            voices](/docs/api-reference/voices/list-voices) for available IDs.

            - `uc_` — Custom voices created via [Instant
            cloning](/docs/api-reference/voices/instant-cloning) (e.g.,
            `uc_64a1b2c3d4e5f6a7b8c9d0e1`). Only the owner of a cloned voice can
            use it.


            Case-sensitive: must use lowercase prefix.
          example: tc_60e5426de8b95f1d3000d7b5
        text:
          type: string
          title: Text
          description: >-
            Text to convert to speech. Minimum 1 character, maximum 2000
            characters. Credits consumed based on text length. Supports multiple
            languages including English, Korean, Japanese, and Chinese. Special
            characters and punctuation are handled automatically.
          example: Everything is so incredibly perfect that I feel like I'm dreaming.
          minLength: 1
          maxLength: 2000
        model:
          $ref: '#/components/schemas/TTSModel'
          description: >
            Voice model to use for speech synthesis.


            - **ssfm-v30**: Latest model with improved prosody and additional
            emotion presets (recommended)

            - **ssfm-v21**: Stable production model with reliable quality
          example: ssfm-v30
        language:
          type: string
          title: Language
          description: >
            Language code following ISO 639-3 standard. Case-insensitive (both
            "ENG" and "eng" are accepted). If not provided, will be
            auto-detected based on text content.


            <details>

            <summary><strong>ssfm-v30 Supported Languages
            (37)</strong></summary>


            | Code | Language | Code | Language | Code | Language |

            |------|----------|------|----------|------|----------|

            | ARA | Arabic | IND | Indonesian | POR | Portuguese |

            | BEN | Bengali | ITA | Italian | RON | Romanian |

            | BUL | Bulgarian | JPN | Japanese | RUS | Russian |

            | CES | Czech | KOR | Korean | SLK | Slovak |

            | DAN | Danish | MSA | Malay | SPA | Spanish |

            | DEU | German | NAN | Min Nan | SWE | Swedish |

            | ELL | Greek | NLD | Dutch | TAM | Tamil |

            | ENG | English | NOR | Norwegian | TGL | Tagalog |

            | FIN | Finnish | PAN | Punjabi | THA | Thai |

            | FRA | French | POL | Polish | TUR | Turkish |

            | HIN | Hindi | UKR | Ukrainian | VIE | Vietnamese |

            | HRV | Croatian | YUE | Cantonese | ZHO | Chinese |

            | HUN | Hungarian | | | | |


            </details>


            <details>

            <summary><strong>ssfm-v21 Supported Languages
            (27)</strong></summary>


            | Code | Language | Code | Language | Code | Language |

            |------|----------|------|----------|------|----------|

            | ARA | Arabic | IND | Indonesian | RON | Romanian |

            | BUL | Bulgarian | ITA | Italian | RUS | Russian |

            | CES | Czech | JPN | Japanese | SLK | Slovak |

            | DAN | Danish | KOR | Korean | SPA | Spanish |

            | DEU | German | MSA | Malay | SWE | Swedish |

            | ELL | Greek | NLD | Dutch | TAM | Tamil |

            | ENG | English | POL | Polish | TGL | Tagalog |

            | FIN | Finnish | POR | Portuguese | UKR | Ukrainian |

            | FRA | French | HRV | Croatian | ZHO | Chinese |


            </details>


            > **Timestamp endpoint note.** For languages without inter-word
            whitespace — Japanese (`jpn`) and Chinese (`zho`) — word-level
            alignment collapses the whole sentence into a single segment. Always
            pair these languages with `granularity=char` to receive usable
            per-character timestamps.
          example: eng
        prompt:
          title: Prompt
          description: >-
            Emotion and style settings for the generated speech, including
            emotion type (happy/sad/angry/normal) and intensity (0.0 to 2.0) to
            control the emotional expression
          oneOf:
            - $ref: '#/components/schemas/SmartPrompt'
            - $ref: '#/components/schemas/PresetPrompt'
            - $ref: '#/components/schemas/Prompt'
          discriminator:
            propertyName: emotion_type
            mapping:
              preset:
                $ref: '#/components/schemas/PresetPrompt'
              smart:
                $ref: '#/components/schemas/SmartPrompt'
        output:
          $ref: '#/components/schemas/Output'
          description: >-
            Audio output settings including volume (0-200), pitch (-12 to +12
            semitones), tempo (0.5x to 2.0x), and format (wav/mp3) for
            controlling the final audio characteristics
        seed:
          type: integer
          minimum: 0
          title: Seed
          description: >-
            Unsigned integer seed for reproducible speech generation. The same
            seed with the same input parameters will produce identical audio
            output.


            - Must be a non-negative integer (≥ 0). Negative values are not
            accepted.

            - If omitted, the server generates a random seed each time,
            producing slight variations.
          example: 42
          anyOf:
            - type: integer
              maximum: 4294967295
              minimum: 0
            - type: 'null'
          format: uint32
      required:
        - voice_id
        - text
        - model
      title: TTSRequestWith-timestamps
      description: Text-to-speech request parameters
    TTSWithTimestampsResponse:
      type: object
      properties:
        audio:
          type: string
          title: Audio
          description: >-
            Base64-encoded audio bytes. Decode and write to a file using the
            `audio_format` extension.
        audio_format:
          type: string
          enum:
            - wav
            - mp3
          title: Audio Format
          description: >-
            Audio encoding format of the bytes in `audio` — either `wav` or
            `mp3`, mirroring the request's `output.audio_format`.
        audio_duration:
          type: number
          title: Audio Duration
          description: Length of the generated audio in seconds.
        words:
          title: Words
          description: >-
            Word-level timestamps (with attached punctuation). `null` when the
            request uses `granularity=char`.
          anyOf:
            - type: array
              items:
                $ref: '#/components/schemas/AlignmentSegmentWord'
            - type: 'null'
        characters:
          title: Characters
          description: >-
            Character-level timestamps (including punctuation and whitespace).
            `null` when the request uses `granularity=word`.
          anyOf:
            - type: array
              items:
                $ref: '#/components/schemas/AlignmentSegmentCharacter'
            - type: 'null'
      required:
        - audio
        - audio_format
        - audio_duration
        - words
        - characters
      title: TTSWithTimestampsResponse
      description: >-
        Response payload for POST /v1/text-to-speech/with-timestamps —
        base64-encoded audio plus per-word and per-character timestamps aligned
        with the generated speech.
    ErrorResponse:
      type: object
      properties:
        detail:
          type: string
          description: Error message describing the issue
      required:
        - detail
      example:
        detail: An error occurred processing the request
    TTSModel:
      type: string
      enum:
        - ssfm-v30
        - ssfm-v21
      title: TTSModel
      description: >
        TTS model version to use for speech synthesis. Different models offer
        varying capabilities and quality levels.


        Available models:

        - **ssfm-v30**: Latest model with improved prosody and additional
        emotion presets (recommended)

        - **ssfm-v21**: Stable production model with proven reliability and
        consistent quality
    SmartPrompt:
      type: object
      properties:
        emotion_type:
          type: string
          title: Emotion Type
          description: >
            Discriminator field to identify the prompt type. Must be set to
            "smart" for context-aware emotion inference.
          default: smart
          const: smart
        previous_text:
          type: string
          title: Previous Text
          description: >
            Text that comes BEFORE the `text` field in TTSRequest. Provides
            backward context for emotion inference.


            The model analyzes the flow: `previous_text` → `text` (synthesized)
            → `next_text`


            - Maximum 2000 characters

            - Helps the model understand emotional build-up and context

            - Leave empty if no preceding context is available
          default: ''
          example: I feel like I'm walking on air and I just want to scream with joy!
        next_text:
          type: string
          title: Next Text
          description: >
            Text that comes AFTER the `text` field in TTSRequest. Provides
            forward context for emotion inference.


            The model analyzes the flow: `previous_text` → `text` (synthesized)
            → `next_text`


            - Maximum 2000 characters

            - Helps the model anticipate emotional transitions

            - Leave empty if no following context is available
          default: ''
          example: >-
            I am literally bursting with happiness and I never want this feeling
            to end!
      title: SmartPrompt (ssfm-v30)
      description: Emotion and style settings for the generated speech.
      example:
        emotion_type: smart
        previous_text: I feel like I'm walking on air and I just want to scream with joy!
        next_text: >-
          I am literally bursting with happiness and I never want this feeling
          to end!
      additionalProperties: false
    PresetPrompt:
      type: object
      properties:
        emotion_type:
          type: string
          title: Emotion Type
          description: >
            Discriminator field to identify the prompt type. Must be set to
            "preset" for preset-based emotion control.
          default: preset
          const: preset
        emotion_preset:
          $ref: '#/components/schemas/EmotionEnum'
          description: >
            Emotion preset to apply to the generated speech.


            Supported emotions: normal, happy, sad, angry, whisper, toneup,
            tonedown


            Check available emotions for each voice through the /v2/voices API.
          default: normal
          example: normal
        emotion_intensity:
          type: number
          maximum: 2
          minimum: 0
          title: Emotion Intensity
          description: >
            Controls the strength of emotional expression in the generated
            speech.


            - 0.0: Completely neutral, no emotional coloring

            - 0.5: Subtle emotional hints

            - 1.0: Standard emotional expression (default)

            - 1.5: Strong emotional emphasis

            - 2.0: Maximum intensity, highly expressive
          default: 1
          example: 1
      title: PresetPrompt (ssfm-v30)
      description: Emotion and style settings for the generated speech.
      additionalProperties: false
    Prompt:
      properties:
        emotion_preset:
          description: |
            Emotion preset to apply.

            Supported emotions for ssfm-v21: normal, happy, sad, angry

            Check available emotions for each voice through the /v2/voices API.
          example: normal
        emotion_intensity:
          description: |
            Controls the strength of emotional expression (0.0 to 2.0).

            - 0.0: Completely neutral
            - 1.0: Standard expression (default)
            - 2.0: Maximum intensity
          example: 1
      title: Prompt (ssfm-v21)
      description: Emotion and style settings for the generated speech.
    Output:
      type: object
      properties:
        target_lufs:
          type: number
          title: Target Lufs
          description: >
            Sets the target absolute loudness (LUFS) for the output audio. This
            normalizes all generated voices to a consistent volume level,
            regardless of the original source's loudness. Values closer to 0 are
            louder, while values closer to -70 are quieter.


            - Required range: -70 <= x <= 0

            - Recommended values: -14 (common streaming standard), -23
            (broadcast standard)

            - **Note:** This parameter cannot be used simultaneously with the
            `volume` parameter. Use `target_lufs` for consistent absolute
            loudness across different clips, or use `volume` for traditional
            relative scaling.
          example: -14
          anyOf:
            - type: number
              maximum: 0
              minimum: -70
            - type: 'null'
        volume:
          title: Volume
          description: >
            Adjusts the relative volume of the output audio: 0 (completely
            silent), 50 (half volume), 100 (standard volume, default), 150 (50%
            louder than standard), 200 (maximum volume, twice as loud as
            standard).


            Since this only scales the existing volume, using `volume` can
            amplify the loudness differences between voices if they have
            different baseline levels. For consistent output across all clips,
            use `target_lufs` instead.


            - **Note:** This parameter cannot be used simultaneously with the
            `target_lufs` parameter.


            Required range: 0 <= x <= 200
          example: 100
          anyOf:
            - type: integer
              maximum: 200
              minimum: 0
            - type: 'null'
        audio_pitch:
          type: integer
          maximum: 12
          minimum: -12
          title: Audio Pitch
          description: >-
            Adjusts the pitch in semitones to affect perceived gender and age:
            -12 (one octave lower, deeper voice), -6 (half octave lower), 0
            (original pitch, default), +6 (half octave higher), +12 (one octave
            higher, higher voice)
          default: 0
          example: 0
        audio_tempo:
          type: number
          maximum: 2
          minimum: 0.5
          title: Audio Tempo
          description: >-
            Controls speech speed: 0.5 (half speed, very slow and clear), 0.75
            (slightly slower than normal), 1.0 (normal speaking speed, default),
            1.5 (50% faster than normal), 2.0 (double speed, very fast speech)
          default: 1
          example: 1
        audio_format:
          type: string
          enum:
            - wav
            - mp3
          title: Audio Format
          description: |
            Output audio format.

            **WAV format:**
            - Uncompressed PCM audio
            - 16-bit depth, mono channel, 44100 Hz sample rate
            - Higher quality, larger file size
            - Recommended for professional audio production

            **MP3 format:**
            - Compressed MPEG Layer III audio
            - 320 kbps bitrate, 44100 Hz sample rate
            - Smaller file size
            - Recommended for web streaming and distribution
          default: wav
          example: wav
      title: Output
      description: Audio output settings for controlling the final audio characteristics
    AlignmentSegmentWord:
      type: object
      properties:
        text:
          type: string
          title: Text
          description: >-
            The text fragment from the original transcript (includes any
            attached punctuation).
        start:
          type: number
          title: Start
          description: >-
            Start time of this segment, in seconds from the beginning of the
            audio.
        end:
          type: number
          title: End
          description: >-
            End time of this segment, in seconds from the beginning of the
            audio.
      required:
        - text
        - start
        - end
      title: AlignmentSegmentWord
      description: >-
        A single word-level alignment segment between the original transcript
        and the generated audio.
    AlignmentSegmentCharacter:
      type: object
      properties:
        text:
          type: string
          title: Text
          description: >-
            The text fragment from the original transcript (includes any
            attached punctuation and whitespace).
        start:
          type: number
          title: Start
          description: >-
            Start time of this segment, in seconds from the beginning of the
            audio.
        end:
          type: number
          title: End
          description: >-
            End time of this segment, in seconds from the beginning of the
            audio.
      required:
        - text
        - start
        - end
      title: AlignmentSegmentCharacter
      description: >-
        A single character-level alignment segment between the original
        transcript and the generated audio.
    EmotionEnum:
      type: string
      enum:
        - normal
        - sad
        - happy
        - angry
        - whisper
        - toneup
        - tonedown
      title: EmotionEnum
      description: >
        Available emotion presets for speech synthesis. Each emotion affects the
        tone, pace, and expressiveness of the generated speech.


        **ssfm-v21 Supported Emotions (4 types):**

        - normal: Neutral, balanced tone

        - happy: Bright, cheerful expression

        - sad: Melancholic, subdued tone

        - angry: Strong, intense delivery


        **ssfm-v30 Supported Emotions (7 types):**

        - normal: Neutral, balanced tone

        - happy: Bright, cheerful expression

        - sad: Melancholic, subdued tone

        - angry: Strong, intense delivery

        - whisper: Soft, quiet speech

        - toneup: Higher tonal emphasis

        - tonedown: Lower tonal emphasis


        Check available emotions for each voice through the /v2/voices API
        response.
  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: X-API-KEY
      description: >-
        API key for authentication. You can obtain an API key from the Typecast
        API Console.

````