> ## Documentation Index
> Fetch the complete documentation index at: https://typecast.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Streaming Text To Speech

> Generate speech from text using real-time streaming, allowing audio playback to begin before the entire synthesis is complete.

This endpoint streams audio data in chunks, enabling low-latency audio playback for applications requiring immediate feedback.

**Streaming Format:**
- **WAV format**: First chunk contains WAV header (size=0xFFFFFFFF for streaming) followed by raw PCM data. Subsequent chunks contain only PCM data.
- **MP3 format**: Each chunk contains post-processed MP3 data that can be decoded independently.

**Use Cases:**
- Conversational AI, chatbots and real-time voice assistants
- Interactive applications requiring immediate audio feedback
- Long-form content where waiting for full synthesis is impractical

**Request Parameters:**
Uses the same TTSRequest schema as the standard TTS endpoint. Set `output.audio_format` to "wav" or "mp3" to control the streaming format.
```


## OpenAPI

````yaml /api-reference/openapi.json post /v1/text-to-speech/stream
openapi: 3.1.0
info:
  title: Typecast API
  version: 0.1.2
  x-logo:
    url: https://typecast.ai/_ipx/_/image/logo/tc_logo.webp
servers:
  - url: https://api.typecast.ai
    description: Production server
security:
  - ApiKeyAuth: []
paths:
  /v1/text-to-speech/stream:
    post:
      tags:
        - Text-to-Speech
      summary: Streaming Text To Speech
      description: >-
        Generate speech from text using real-time streaming, allowing audio
        playback to begin before the entire synthesis is complete.


        This endpoint streams audio data in chunks, enabling low-latency audio
        playback for applications requiring immediate feedback.


        **Streaming Format:**

        - **WAV format**: First chunk contains WAV header (size=0xFFFFFFFF for
        streaming) followed by raw PCM data. Subsequent chunks contain only PCM
        data.

        - **MP3 format**: Each chunk contains post-processed MP3 data that can
        be decoded independently.


        **Use Cases:**

        - Conversational AI, chatbots and real-time voice assistants

        - Interactive applications requiring immediate audio feedback

        - Long-form content where waiting for full synthesis is impractical


        **Request Parameters:**

        Uses the same TTSRequest schema as the standard TTS endpoint. Set
        `output.audio_format` to "wav" or "mp3" to control the streaming format.

        ```
      operationId: text_to_speech_stream_v1_text_to_speech_stream_post
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/TTSRequestStream'
        required: true
      responses:
        '200':
          description: Success - Returns streaming audio data in chunks
          content:
            audio/wav:
              schema:
                type: string
                format: binary
                description: >-
                  Chunked WAV audio stream (16-bit, mono, 32000 Hz). First chunk
                  includes WAV header with size 0xFFFFFFFF (indicating
                  streaming), followed by raw PCM data. Subsequent chunks
                  contain only PCM data.
              example: '[Binary audio stream - WAV chunks]'
            audio/mpeg:
              schema:
                type: string
                format: binary
                description: >-
                  Chunked MP3 audio stream. Each chunk contains valid MP3 frames
                  that can be decoded and played independently.
              example: '[Binary audio stream - MP3 chunks]'
        '400':
          description: Bad Request - Invalid parameters
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid voice_id
        '401':
          description: Unauthorized - Authentication failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid API key
        '402':
          description: Payment Required - Insufficient credits
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Insufficient credit
        '404':
          description: Not Found - Voice model not available
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Voice not found
        '422':
          description: Validation Error - Request validation failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Invalid request format
        '429':
          description: Too Many Requests - Rate limit exceeded
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: Too many requests
        '500':
          description: Internal Server Error - Server processing failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                detail: An unexpected error occurred
      x-codeSamples:
        - lang: cURL
          label: cURL (stream + play)
          source: |
            # Pipe streaming audio directly into ffplay for real-time playback.
            # Requires: ffmpeg (brew/choco/apt install ffmpeg)
            curl -N -s --request POST \
              --url https://api.typecast.ai/v1/text-to-speech/stream \
              --header 'Content-Type: application/json' \
              --header 'X-API-KEY: <api-key>' \
              --data @- <<EOF | ffplay -autoexit -nodisp -loglevel error -i pipe:0
            {
              "voice_id": "tc_60e5426de8b95f1d3000d7b5",
              "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
              "model": "ssfm-v30"
            }
            EOF
        - lang: Python
          label: Python (requests + sounddevice)
          source: >
            # Real-time playback using sounddevice (pip install requests
            sounddevice).

            # Streaming WAV format: 32000 Hz, 16-bit, mono — skip the 44-byte

            # WAV header and feed raw PCM samples to the audio output.

            import requests

            import sounddevice as sd


            API_HOST = "https://api.typecast.ai"

            headers = {"X-API-KEY": "<api-key>", "Content-Type":
            "application/json"}

            payload = {
                "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                "model": "ssfm-v30",
            }


            resp = requests.post(
                f"{API_HOST}/v1/text-to-speech/stream",
                headers=headers, json=payload, stream=True, timeout=60,
            )

            resp.raise_for_status()


            with sd.RawOutputStream(samplerate=32000, channels=1, dtype="int16")
            as player:
                buf, first = bytearray(), True
                for chunk in resp.iter_content(chunk_size=4096):
                    if not chunk:
                        continue
                    if first:
                        chunk = chunk[44:]  # strip WAV header
                        first = False
                    buf.extend(chunk)
                    # Write 2-byte-aligned slices (int16 samples).
                    n = len(buf) - (len(buf) % 2)
                    if n:
                        player.write(bytes(buf[:n]))
                        del buf[:n]

            print("Playback completed")
        - lang: C#
          label: C# (HttpClient + ffplay)
          source: >
            // Real-time playback by piping the stream into ffplay.

            // Requires: ffmpeg (brew/choco/apt install ffmpeg)

            using System;

            using System.Diagnostics;

            using System.Net.Http;

            using System.Text;

            using System.Threading.Tasks;


            var client = new HttpClient();

            client.DefaultRequestHeaders.Add("X-API-KEY", "<api-key>");


            var requestBody = @"{
              ""voice_id"": ""tc_60e5426de8b95f1d3000d7b5"",
              ""text"": ""Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM."",
              ""model"": ""ssfm-v30""
            }";


            var ffplay = new Process

            {
                StartInfo = new ProcessStartInfo
                {
                    FileName = "ffplay",
                    Arguments = "-autoexit -nodisp -loglevel error -i pipe:0",
                    RedirectStandardInput = true,
                    UseShellExecute = false,
                }
            };

            ffplay.Start();


            var request = new HttpRequestMessage(HttpMethod.Post,
            "https://api.typecast.ai/v1/text-to-speech/stream")

            {
                Content = new StringContent(requestBody, Encoding.UTF8, "application/json")
            };


            // ResponseHeadersRead enables true streaming (avoids full
            buffering).

            using var response = await client.SendAsync(request,
            HttpCompletionOption.ResponseHeadersRead);

            response.EnsureSuccessStatusCode();

            using var stream = await response.Content.ReadAsStreamAsync();

            await stream.CopyToAsync(ffplay.StandardInput.BaseStream);

            ffplay.StandardInput.Close();

            await ffplay.WaitForExitAsync();
        - lang: Kotlin
          label: Kotlin (OkHttp + ffplay)
          source: >
            // Real-time playback by piping the OkHttp response stream into
            ffplay.

            // Requires: ffmpeg (brew/choco/apt install ffmpeg)

            // For Android, replace the ffplay Process with AudioTrack + raw PCM
            feed.

            import okhttp3.MediaType.Companion.toMediaType

            import okhttp3.OkHttpClient

            import okhttp3.Request

            import okhttp3.RequestBody.Companion.toRequestBody


            val ffplay = ProcessBuilder(
                "ffplay", "-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0"
            ).redirectError(ProcessBuilder.Redirect.DISCARD).start()


            val client = OkHttpClient()

            val body = """

            {
              "voice_id": "tc_60e5426de8b95f1d3000d7b5",
              "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
              "model": "ssfm-v30"
            }

            """.trimIndent().toRequestBody("application/json".toMediaType())


            val request = Request.Builder()
                .url("https://api.typecast.ai/v1/text-to-speech/stream")
                .addHeader("X-API-KEY", "<api-key>")
                .post(body)
                .build()

            client.newCall(request).execute().use { response ->
                response.body?.byteStream()?.use { input -> input.copyTo(ffplay.outputStream) }
            }

            ffplay.outputStream.close()

            ffplay.waitFor()
        - lang: C++
          label: C++ (libcurl + ffplay)
          source: >
            // Real-time playback: libcurl write callback pipes each chunk into

            // ffplay via popen. Requires: ffmpeg (brew/choco/apt install
            ffmpeg)

            #include <curl/curl.h>

            #include <cstdio>

            #include <string>


            static FILE* player = nullptr;


            size_t cb(void* ptr, size_t size, size_t nmemb, void*) {
                return fwrite(ptr, size, nmemb, player);
            }


            int main() {
                player = popen("ffplay -autoexit -nodisp -loglevel error -i pipe:0", "w");

                CURL* curl = curl_easy_init();
                struct curl_slist* headers = nullptr;
                headers = curl_slist_append(headers, "Content-Type: application/json");
                headers = curl_slist_append(headers, "X-API-KEY: <api-key>");

                std::string body = R"({
                    "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                    "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                    "model": "ssfm-v30"
                })";

                curl_easy_setopt(curl, CURLOPT_URL, "https://api.typecast.ai/v1/text-to-speech/stream");
                curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
                curl_easy_setopt(curl, CURLOPT_POSTFIELDS, body.c_str());
                curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, cb);

                curl_easy_perform(curl);

                curl_slist_free_all(headers);
                curl_easy_cleanup(curl);
                pclose(player);
                return 0;
            }
        - lang: C
          label: C (libcurl + ffplay)
          source: |
            /* Real-time playback: libcurl write callback pipes each chunk into
             * ffplay via popen. Requires: ffmpeg (brew/choco/apt install ffmpeg) */
            #include <stdio.h>
            #include <curl/curl.h>

            static FILE* player = NULL;

            size_t cb(void* ptr, size_t size, size_t nmemb, void* ud) {
                (void)ud;
                return fwrite(ptr, size, nmemb, player);
            }

            int main(void) {
                player = popen("ffplay -autoexit -nodisp -loglevel error -i pipe:0", "w");

                curl_global_init(CURL_GLOBAL_ALL);
                CURL* curl = curl_easy_init();

                struct curl_slist* headers = NULL;
                headers = curl_slist_append(headers, "Content-Type: application/json");
                headers = curl_slist_append(headers, "X-API-KEY: <api-key>");

                const char* body =
                    "{"
                    "\"voice_id\":\"tc_60e5426de8b95f1d3000d7b5\","
                    "\"text\":\"Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.\","
                    "\"model\":\"ssfm-v30\""
                    "}";

                curl_easy_setopt(curl, CURLOPT_URL, "https://api.typecast.ai/v1/text-to-speech/stream");
                curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
                curl_easy_setopt(curl, CURLOPT_POSTFIELDS, body);
                curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, cb);

                curl_easy_perform(curl);

                curl_slist_free_all(headers);
                curl_easy_cleanup(curl);
                curl_global_cleanup();
                pclose(player);
                return 0;
            }
        - lang: Swift
          label: Swift (URLSession + ffplay)
          source: >
            // Real-time playback (macOS): pipe URLSession bytes into ffplay via

            // Process. Requires: ffmpeg (brew install ffmpeg).

            // Requires iOS 15 / macOS 12 for URLSession.bytes(for:).

            // Compile with: swiftc -parse-as-library main.swift -o
            streaming_tts

            // For iOS, replace Process/ffplay with AVAudioEngine + scheduled
            PCM buffers.

            import Foundation


            @main

            struct StreamingTTS {
                static func main() async throws {
                    let ffplay = Process()
                    ffplay.executableURL = URL(fileURLWithPath: "/usr/bin/env")
                    ffplay.arguments = ["ffplay", "-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0"]
                    let pipe = Pipe()
                    ffplay.standardInput = pipe
                    try ffplay.run()

                    var request = URLRequest(url: URL(string: "https://api.typecast.ai/v1/text-to-speech/stream")!)
                    request.httpMethod = "POST"
                    request.setValue("application/json", forHTTPHeaderField: "Content-Type")
                    request.setValue("<api-key>", forHTTPHeaderField: "X-API-KEY")
                    request.httpBody = try JSONSerialization.data(withJSONObject: [
                        "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                        "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                        "model": "ssfm-v30",
                    ])

                    let (bytes, _) = try await URLSession.shared.bytes(for: request)
                    var buffer = Data()
                    buffer.reserveCapacity(4096)
                    for try await byte in bytes {
                        buffer.append(byte)
                        if buffer.count >= 4096 {
                            try pipe.fileHandleForWriting.write(contentsOf: buffer)
                            buffer.removeAll(keepingCapacity: true)
                        }
                    }
                    if !buffer.isEmpty {
                        try pipe.fileHandleForWriting.write(contentsOf: buffer)
                    }
                    try pipe.fileHandleForWriting.close()
                    ffplay.waitUntilExit()
                }
            }
        - lang: Rust
          label: Rust (reqwest + ffplay)
          source: >
            // Real-time playback: pipe reqwest stream into ffplay via tokio
            Command.

            // Requires: ffmpeg (brew/choco/apt install ffmpeg)

            // Cargo.toml:

            //   reqwest = { version = "0.12", features = ["json", "stream"] }

            //   tokio   = { version = "1", features = ["full"] }

            //   serde_json = "1"

            use reqwest;

            use serde_json::json;

            use std::process::Stdio;

            use tokio::io::AsyncWriteExt;

            use tokio::process::Command;


            #[tokio::main]

            async fn main() -> Result<(), Box<dyn std::error::Error>> {
                let mut ffplay = Command::new("ffplay")
                    .args(["-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0"])
                    .stdin(Stdio::piped())
                    .spawn()?;
                let mut stdin = ffplay.stdin.take().expect("failed to open ffplay stdin");

                let client = reqwest::Client::new();
                let body = json!({
                    "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                    "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                    "model": "ssfm-v30"
                });

                let mut response = client
                    .post("https://api.typecast.ai/v1/text-to-speech/stream")
                    .header("X-API-KEY", "<api-key>")
                    .header("Content-Type", "application/json")
                    .json(&body)
                    .send()
                    .await?;

                while let Some(chunk) = response.chunk().await? {
                    stdin.write_all(&chunk).await?;
                }
                drop(stdin);
                ffplay.wait().await?;
                Ok(())
            }
        - lang: JavaScript
          label: JavaScript (Node.js + ffplay)
          source: >
            // Node 18+ (built-in fetch). Pipe streamed audio into ffplay.

            // Requires: ffmpeg (brew/choco/apt install ffmpeg)

            import { spawn } from "node:child_process";


            const ffplay = spawn(
                "ffplay",
                ["-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0"],
                { stdio: ["pipe", "ignore", "ignore"] },
            );


            const response = await
            fetch("https://api.typecast.ai/v1/text-to-speech/stream", {
                method: "POST",
                headers: {
                    "Content-Type": "application/json",
                    "X-API-KEY": "<api-key>",
                },
                body: JSON.stringify({
                    voice_id: "tc_60e5426de8b95f1d3000d7b5",
                    text: "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                    model: "ssfm-v30",
                }),
            });

            if (!response.ok) throw new Error(`HTTP ${response.status}`);


            // fetch().body is a Web ReadableStream — read chunks as they
            arrive.

            const reader = response.body.getReader();

            while (true) {
                const { value, done } = await reader.read();
                if (done) break;
                ffplay.stdin.write(value);
            }

            ffplay.stdin.end();

            await new Promise((resolve) => ffplay.on("close", resolve));
        - lang: PHP
          label: PHP (curl + ffplay)
          source: >
            <?php

            // Pipes the libcurl write callback straight into ffplay's stdin.

            // Requires: ffmpeg (brew/choco/apt install ffmpeg)

            $ffplay = popen("ffplay -autoexit -nodisp -loglevel error -i
            pipe:0", "w");


            $payload = json_encode([
                "voice_id" => "tc_60e5426de8b95f1d3000d7b5",
                "text" => "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                "model" => "ssfm-v30",
            ]);


            $ch = curl_init("https://api.typecast.ai/v1/text-to-speech/stream");

            curl_setopt_array($ch, [
                CURLOPT_POST => true,
                CURLOPT_HTTPHEADER => [
                    "Content-Type: application/json",
                    "X-API-KEY: <api-key>",
                ],
                CURLOPT_POSTFIELDS => $payload,
                CURLOPT_WRITEFUNCTION => function ($ch, $data) use ($ffplay) {
                    fwrite($ffplay, $data);
                    return strlen($data);
                },
            ]);

            curl_exec($ch);

            pclose($ffplay);
        - lang: Go
          label: Go (net/http + ffplay)
          source: |
            // Pipes the streaming response body into ffplay's stdin.
            // Requires: ffmpeg (brew/choco/apt install ffmpeg)
            package main

            import (
                "bytes"
                "io"
                "net/http"
                "os/exec"
            )

            func main() {
                ffplay := exec.Command("ffplay", "-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0")
                stdin, _ := ffplay.StdinPipe()
                if err := ffplay.Start(); err != nil {
                    panic(err)
                }

                body := []byte(`{
                    "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                    "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                    "model": "ssfm-v30"
                }`)

                req, _ := http.NewRequest("POST", "https://api.typecast.ai/v1/text-to-speech/stream", bytes.NewReader(body))
                req.Header.Set("Content-Type", "application/json")
                req.Header.Set("X-API-KEY", "<api-key>")

                resp, err := http.DefaultClient.Do(req)
                if err != nil {
                    panic(err)
                }
                defer resp.Body.Close()

                io.Copy(stdin, resp.Body)
                stdin.Close()
                ffplay.Wait()
            }
        - lang: Java
          label: Java (HttpClient + ffplay)
          source: |
            // Java 11+ HttpClient with InputStream body handler.
            // Pipes the streaming response into ffplay's stdin.
            // Requires: ffmpeg (brew/choco/apt install ffmpeg)
            import java.net.URI;
            import java.net.http.HttpClient;
            import java.net.http.HttpRequest;
            import java.net.http.HttpResponse;
            import java.io.InputStream;
            import java.io.OutputStream;

            public class StreamingTTS {
                public static void main(String[] args) throws Exception {
                    Process ffplay = new ProcessBuilder(
                            "ffplay", "-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0")
                            .redirectError(ProcessBuilder.Redirect.DISCARD)
                            .start();

                    String body = """
                        {
                          "voice_id": "tc_60e5426de8b95f1d3000d7b5",
                          "text": "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
                          "model": "ssfm-v30"
                        }
                        """;

                    HttpRequest request = HttpRequest.newBuilder()
                            .uri(URI.create("https://api.typecast.ai/v1/text-to-speech/stream"))
                            .header("Content-Type", "application/json")
                            .header("X-API-KEY", "<api-key>")
                            .POST(HttpRequest.BodyPublishers.ofString(body))
                            .build();

                    HttpResponse<InputStream> response = HttpClient.newHttpClient()
                            .send(request, HttpResponse.BodyHandlers.ofInputStream());

                    try (InputStream in = response.body();
                         OutputStream out = ffplay.getOutputStream()) {
                        in.transferTo(out);
                    }
                    ffplay.waitFor();
                }
            }
        - lang: Ruby
          label: Ruby (net/http + ffplay)
          source: |
            # Pipes the streaming response into ffplay via IO.popen.
            # Requires: ffmpeg (brew/choco/apt install ffmpeg)
            require "net/http"
            require "uri"
            require "json"

            ffplay = IO.popen(
              ["ffplay", "-autoexit", "-nodisp", "-loglevel", "error", "-i", "pipe:0"],
              "wb",
            )

            uri = URI("https://api.typecast.ai/v1/text-to-speech/stream")
            http = Net::HTTP.new(uri.host, uri.port)
            http.use_ssl = true

            req = Net::HTTP::Post.new(uri)
            req["Content-Type"] = "application/json"
            req["X-API-KEY"] = "<api-key>"
            req.body = {
              voice_id: "tc_60e5426de8b95f1d3000d7b5",
              text: "Thanks for reaching out. Your reservation has been confirmed for Friday at 7 PM.",
              model: "ssfm-v30",
            }.to_json

            http.request(req) do |response|
              response.read_body { |chunk| ffplay.write(chunk) }
            end

            ffplay.close
components:
  schemas:
    TTSRequestStream:
      type: object
      properties:
        voice_id:
          type: string
          title: Voice Id
          description: >-
            Voice identifier. Two prefixes are supported:


            - `tc_` — Built-in Typecast voices (e.g.,
            `tc_60e5426de8b95f1d3000d7b5`). See [Listing all
            voices](/docs/api-reference/voices/list-voices) for available IDs.

            - `uc_` — Custom voices created via [Instant
            cloning](/docs/api-reference/voices/instant-cloning) (e.g.,
            `uc_64a1b2c3d4e5f6a7b8c9d0e1`). Only the owner of a cloned voice can
            use it.


            Case-sensitive: must use lowercase prefix.
          example: tc_60e5426de8b95f1d3000d7b5
        text:
          type: string
          title: Text
          description: >-
            Text to convert to speech. Minimum 1 character, maximum 2000
            characters. Credits consumed based on text length. Supports multiple
            languages including English, Korean, Japanese, and Chinese. Special
            characters and punctuation are handled automatically.
          example: Everything is so incredibly perfect that I feel like I'm dreaming.
          minLength: 1
          maxLength: 2000
        model:
          $ref: '#/components/schemas/TTSModel'
          description: >
            Voice model to use for speech synthesis.


            - **ssfm-v30**: Latest model with improved prosody and additional
            emotion presets (recommended)

            - **ssfm-v21**: Stable production model with reliable quality
          example: ssfm-v30
        language:
          type: string
          title: Language
          description: >
            Language code following ISO 639-3 standard. Case-insensitive (both
            "ENG" and "eng" are accepted). If not provided, will be
            auto-detected based on text content.


            <details>

            <summary><strong>ssfm-v30 Supported Languages
            (37)</strong></summary>


            | Code | Language | Code | Language | Code | Language |

            |------|----------|------|----------|------|----------|

            | ARA | Arabic | IND | Indonesian | POR | Portuguese |

            | BEN | Bengali | ITA | Italian | RON | Romanian |

            | BUL | Bulgarian | JPN | Japanese | RUS | Russian |

            | CES | Czech | KOR | Korean | SLK | Slovak |

            | DAN | Danish | MSA | Malay | SPA | Spanish |

            | DEU | German | NAN | Min Nan | SWE | Swedish |

            | ELL | Greek | NLD | Dutch | TAM | Tamil |

            | ENG | English | NOR | Norwegian | TGL | Tagalog |

            | FIN | Finnish | PAN | Punjabi | THA | Thai |

            | FRA | French | POL | Polish | TUR | Turkish |

            | HIN | Hindi | UKR | Ukrainian | VIE | Vietnamese |

            | HRV | Croatian | YUE | Cantonese | ZHO | Chinese |

            | HUN | Hungarian | | | | |


            </details>


            <details>

            <summary><strong>ssfm-v21 Supported Languages
            (27)</strong></summary>


            | Code | Language | Code | Language | Code | Language |

            |------|----------|------|----------|------|----------|

            | ARA | Arabic | IND | Indonesian | RON | Romanian |

            | BUL | Bulgarian | ITA | Italian | RUS | Russian |

            | CES | Czech | JPN | Japanese | SLK | Slovak |

            | DAN | Danish | KOR | Korean | SPA | Spanish |

            | DEU | German | MSA | Malay | SWE | Swedish |

            | ELL | Greek | NLD | Dutch | TAM | Tamil |

            | ENG | English | POL | Polish | TGL | Tagalog |

            | FIN | Finnish | POR | Portuguese | UKR | Ukrainian |

            | FRA | French | HRV | Croatian | ZHO | Chinese |


            </details>
          example: eng
        prompt:
          title: Prompt
          description: >-
            Emotion and style settings for the generated speech, including
            emotion type (happy/sad/angry/normal) and intensity (0.0 to 2.0) to
            control the emotional expression
          oneOf:
            - $ref: '#/components/schemas/SmartPrompt'
            - $ref: '#/components/schemas/PresetPrompt'
            - $ref: '#/components/schemas/Prompt'
          discriminator:
            propertyName: emotion_type
            mapping:
              preset:
                $ref: '#/components/schemas/PresetPrompt'
              smart:
                $ref: '#/components/schemas/SmartPrompt'
        output:
          $ref: '#/components/schemas/OutputStream'
          description: >-
            Streaming audio output settings including pitch (-12 to +12
            semitones), tempo (0.5x to 2.0x), format (wav/mp3), and target_lufs
            (-70 to 0 LUFS). Note: volume is not available in streaming mode.
        seed:
          type: integer
          minimum: 0
          title: Seed
          description: >-
            Unsigned integer seed for reproducible speech generation. The same
            seed with the same input parameters will produce identical audio
            output.


            - Must be a non-negative integer (≥ 0). Negative values are not
            accepted.

            - If omitted, the server generates a random seed each time,
            producing slight variations.
          example: 42
          anyOf:
            - type: integer
              maximum: 4294967295
              minimum: 0
            - type: 'null'
          format: uint32
      required:
        - voice_id
        - text
        - model
      title: TTSRequestStream
      description: Text-to-speech streaming request parameters
    ErrorResponse:
      type: object
      properties:
        detail:
          type: string
          description: Error message describing the issue
      required:
        - detail
      example:
        detail: An error occurred processing the request
    TTSModel:
      type: string
      enum:
        - ssfm-v30
        - ssfm-v21
      title: TTSModel
      description: >
        TTS model version to use for speech synthesis. Different models offer
        varying capabilities and quality levels.


        Available models:

        - **ssfm-v30**: Latest model with improved prosody and additional
        emotion presets (recommended)

        - **ssfm-v21**: Stable production model with proven reliability and
        consistent quality
    SmartPrompt:
      type: object
      properties:
        emotion_type:
          type: string
          title: Emotion Type
          description: >
            Discriminator field to identify the prompt type. Must be set to
            "smart" for context-aware emotion inference.
          default: smart
          const: smart
        previous_text:
          type: string
          title: Previous Text
          description: >
            Text that comes BEFORE the `text` field in TTSRequest. Provides
            backward context for emotion inference.


            The model analyzes the flow: `previous_text` → `text` (synthesized)
            → `next_text`


            - Maximum 2000 characters

            - Helps the model understand emotional build-up and context

            - Leave empty if no preceding context is available
          default: ''
          example: I feel like I'm walking on air and I just want to scream with joy!
        next_text:
          type: string
          title: Next Text
          description: >
            Text that comes AFTER the `text` field in TTSRequest. Provides
            forward context for emotion inference.


            The model analyzes the flow: `previous_text` → `text` (synthesized)
            → `next_text`


            - Maximum 2000 characters

            - Helps the model anticipate emotional transitions

            - Leave empty if no following context is available
          default: ''
          example: >-
            I am literally bursting with happiness and I never want this feeling
            to end!
      title: SmartPrompt (ssfm-v30)
      description: Emotion and style settings for the generated speech.
      example:
        emotion_type: smart
        previous_text: I feel like I'm walking on air and I just want to scream with joy!
        next_text: >-
          I am literally bursting with happiness and I never want this feeling
          to end!
      additionalProperties: false
    PresetPrompt:
      type: object
      properties:
        emotion_type:
          type: string
          title: Emotion Type
          description: >
            Discriminator field to identify the prompt type. Must be set to
            "preset" for preset-based emotion control.
          default: preset
          const: preset
        emotion_preset:
          $ref: '#/components/schemas/EmotionEnum'
          description: >
            Emotion preset to apply to the generated speech.


            Supported emotions: normal, happy, sad, angry, whisper, toneup,
            tonedown


            Check available emotions for each voice through the /v2/voices API.
          default: normal
          example: normal
        emotion_intensity:
          type: number
          maximum: 2
          minimum: 0
          title: Emotion Intensity
          description: >
            Controls the strength of emotional expression in the generated
            speech.


            - 0.0: Completely neutral, no emotional coloring

            - 0.5: Subtle emotional hints

            - 1.0: Standard emotional expression (default)

            - 1.5: Strong emotional emphasis

            - 2.0: Maximum intensity, highly expressive
          default: 1
          example: 1
      title: PresetPrompt (ssfm-v30)
      description: Emotion and style settings for the generated speech.
      additionalProperties: false
    Prompt:
      properties:
        emotion_preset:
          description: |
            Emotion preset to apply.

            Supported emotions for ssfm-v21: normal, happy, sad, angry

            Check available emotions for each voice through the /v2/voices API.
          example: normal
        emotion_intensity:
          description: |
            Controls the strength of emotional expression (0.0 to 2.0).

            - 0.0: Completely neutral
            - 1.0: Standard expression (default)
            - 2.0: Maximum intensity
          example: 1
      title: Prompt (ssfm-v21)
      description: Emotion and style settings for the generated speech.
    OutputStream:
      type: object
      properties:
        target_lufs:
          type: number
          title: Target Lufs
          description: >
            Sets the target absolute loudness (LUFS) for streaming output audio.
            This normalizes generated audio to a consistent loudness regardless
            of the original source. Cannot be used with the `volume` parameter.


            Recommended values: -14 (common streaming standard), -23 (broadcast
            standard).
          example: -14
          anyOf:
            - type: number
              maximum: 0
              minimum: -70
            - type: 'null'
        audio_pitch:
          type: integer
          maximum: 12
          minimum: -12
          title: Audio Pitch
          description: >-
            Adjusts the pitch in semitones to affect perceived gender and age:
            -12 (one octave lower, deeper voice), -6 (half octave lower), 0
            (original pitch, default), +6 (half octave higher), +12 (one octave
            higher, higher voice)
          default: 0
          example: 0
        audio_tempo:
          type: number
          maximum: 2
          minimum: 0.5
          title: Audio Tempo
          description: >-
            Controls speech speed: 0.5 (half speed, very slow and clear), 0.75
            (slightly slower than normal), 1.0 (normal speaking speed, default),
            1.5 (50% faster than normal), 2.0 (double speed, very fast speech)
          default: 1
          example: 1
        audio_format:
          type: string
          enum:
            - wav
            - mp3
          title: Audio Format
          description: >
            Output audio format for streaming.


            **WAV format:**

            - Uncompressed PCM audio

            - 16-bit depth, mono channel, **32000 Hz** sample rate

            - Chunked transfer: first chunk contains the WAV header (size =
            0xFFFFFFFF), subsequent chunks contain raw PCM data

            - Recommended when you want to play audio as it arrives


            **MP3 format:**

            - Compressed MPEG Layer III audio

            - 320 kbps bitrate, 44100 Hz sample rate

            - Chunked transfer: each chunk contains independently decodable MPEG
            frames

            - Recommended for bandwidth-constrained clients
          default: wav
          example: wav
      title: OutputStream
      description: >-
        Audio output settings for streaming. Use `target_lufs` for LUFS loudness
        normalization; `volume` is not available in streaming mode.
    EmotionEnum:
      type: string
      enum:
        - normal
        - sad
        - happy
        - angry
        - whisper
        - toneup
        - tonedown
      title: EmotionEnum
      description: >
        Available emotion presets for speech synthesis. Each emotion affects the
        tone, pace, and expressiveness of the generated speech.


        **ssfm-v21 Supported Emotions (4 types):**

        - normal: Neutral, balanced tone

        - happy: Bright, cheerful expression

        - sad: Melancholic, subdued tone

        - angry: Strong, intense delivery


        **ssfm-v30 Supported Emotions (7 types):**

        - normal: Neutral, balanced tone

        - happy: Bright, cheerful expression

        - sad: Melancholic, subdued tone

        - angry: Strong, intense delivery

        - whisper: Soft, quiet speech

        - toneup: Higher tonal emphasis

        - tonedown: Lower tonal emphasis


        Check available emotions for each voice through the /v2/voices API
        response.
  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: X-API-KEY
      description: >-
        API key for authentication. You can obtain an API key from the Typecast
        API Console.

````