Zig - Typecast Documentation

The official Zig library for the Typecast API. Convert text to lifelike speech using AI-powered voices. Pure Zig implementation — no C dependencies. Uses only std.http.Client and std.json from the Zig standard library.

Source Code

Typecast Zig SDK Source Code

Package

Zig Package (via zig fetch)

Installation

Add the dependency using zig fetch:

zig fetch --save "https://github.com/neosapience/typecast-sdk/archive/refs/tags/typecast-zig/v0.2.7.tar.gz"

Latest registered version: typecast-zig/v0.2.7 in the SDK Git tags.

Then add the import in your build.zig:

const typecast_dep = b.dependency("typecast_zig", .{
    .target = target,
    .optimize = optimize,
});
exe.root_module.addImport("typecast", typecast_dep.module("typecast"));

Quick Start

const std = @import("std");
const typecast = @import("typecast");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Initialize client (reads TYPECAST_API_KEY from environment)
    var client = typecast.Client.init(allocator, .{
        .api_key = std.posix.getenv("TYPECAST_API_KEY") orelse return error.MissingApiKey,
    });
    defer client.deinit();

    // Convert text to speech
    const response = try client.textToSpeech(.{
        .voice_id = "tc_672c5f5ce59fac2a48faeaee", // Find voice IDs at https://typecast.ai/developers/api/voices
        .text = "Hello there! I'm your friendly text-to-speech agent.",
        .model = .ssfm_v30,
    });
    defer allocator.free(response.audio_data);

    // Save audio file
    const file = try std.fs.cwd().createFile("output.wav", .{});
    defer file.close();
    try file.writeAll(response.audio_data);

    std.debug.print("Saved {d} bytes, duration: {d:.1}s\n", .{
        response.audio_data.len, response.duration,
    });
}

Features

The Typecast Zig SDK provides powerful features for text-to-speech conversion:

Multiple Voice Models: Support for ssfm-v30 (latest) and ssfm-v21 AI voice models
Multi-language Support: 37 languages including English, Korean, Spanish, Japanese, Chinese, and more
Emotion Control: Preset emotions (normal, happy, sad, angry, whisper, toneup, tonedown) or smart context-aware inference
Audio Customization: Control loudness (LUFS -70 to 0), pitch (-12 to +12 semitones), tempo (0.5x to 2.0x), and format (WAV/MP3)
Voice Discovery: V2 Voices API with filtering by model, gender, age, and use cases
Instant Voice Cloning: Upload a WAV/MP3 sample and create a custom voice ID
Timestamp TTS: Word- and character-level alignment data for subtitles, karaoke, and lip-sync
Pure Zig: Zero external dependencies, uses only the standard library
Streaming: Real-time chunked audio delivery for low-latency playback via callback
Explicit Memory Management: Caller-supplied allocator with clear ownership semantics

Voice Recommendations

Use recommendVoices when you know the desired style but not the exact voice_id.

const voices = try client.recommendVoices(
    "warm female voice for a product tutorial",
    3,
);
defer {
    for (voices) |voice| {
        allocator.free(voice.voice_id);
        allocator.free(voice.voice_name);
    }
    allocator.free(voices);
}

for (voices) |voice| {
    std.debug.print("{s} {s} {d:.3}\n", .{
        voice.voice_id,
        voice.voice_name,
        voice.score,
    });
}

Recommendation results contain only voice_id, voice_name, and score. Use getVoiceV2 or getVoicesV2 when you need detailed metadata such as supported models, emotions, gender, age, or use cases.

Configuration

Set your API key via environment variable or pass directly:

const typecast = @import("typecast");

// Using environment variable (recommended)
// export TYPECAST_API_KEY="your-api-key-here"
var client = typecast.Client.init(allocator, .{
    .api_key = std.posix.getenv("TYPECAST_API_KEY") orelse return error.MissingApiKey,
});
defer client.deinit();

// Or pass directly
var client = typecast.Client.init(allocator, .{
    .api_key = "your-api-key-here",
});
defer client.deinit();

// Custom base URL
var client = typecast.Client.init(allocator, .{
    .api_key = "your-api-key-here",
    .base_url = "https://custom-api.example.com",
});
defer client.deinit();

When requests go through your own proxy, set base_url to the proxy endpoint and omit api_key. The SDK will not send the X-API-KEY header for empty or missing keys. Requests to the default Typecast host still require an API key.

Proxy without API key

var client = typecast.Client.init(allocator, .{
    .base_url = "https://your-proxy.example.com",
});
defer client.deinit();

Advanced Usage

Emotion Control (ssfm-v30)

ssfm-v30 offers two emotion control modes: Preset and Smart.

Smart Mode
Preset Mode

Let the AI infer emotion from context:

const response = try client.textToSpeech(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .text = "Everything is going to be okay.",
    .model = .ssfm_v30,
    .prompt = .{ .smart = .{
        .previous_text = "I just got the best news!",
        .next_text = "I can't wait to celebrate!",
    } },
});
defer allocator.free(response.audio_data);

Explicitly set emotion with preset values:

const response = try client.textToSpeech(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .text = "I am so excited to show you these features!",
    .model = .ssfm_v30,
    .prompt = .{ .preset = .{
        .emotion_preset = .happy,
        .emotion_intensity = 1.5,
    } },
});
defer allocator.free(response.audio_data);

Audio Customization

Control loudness, pitch, tempo, and output format:

const response = try client.textToSpeech(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .text = "Customized audio output!",
    .model = .ssfm_v30,
    .output = .{
        .target_lufs = -14.0,
        .audio_pitch = 2,
        .audio_tempo = 1.2,
        .audio_format = .mp3,
    },
    .seed = 42,
});
defer allocator.free(response.audio_data);

Generate audio to a file

Use generateToFile when you want the SDK to synthesize speech and write the audio bytes directly to a local file. The model defaults to ssfm-v30, and .mp3 / .wav extensions infer the output format when no output format is set. Browse available voice IDs on the Voices page.

const response = try client.generateToFile("output.mp3", .{
    .text = "Hello from Typecast.",
    .voice_id = "tc_672c5f5ce59fac2a48faeaee", // Find voice IDs at https://typecast.ai/developers/api/voices
});
defer allocator.free(response.audio_data);

Text pauses

Use text pause markup when you only need silent gaps inside one composed text segment. Put <|5s|>, <|1s|>, <|0.3s|>, or <|0.34413s|> directly in the text. The value is interpreted as seconds and must end with s. This keeps the pause expression visible in plain text without adding separate pause calls.

var composer = client.composeSpeech();
try composer.defaults(.{ .voice_id = "tc_672c5f5ce59fac2a48faeaee", .model = .ssfm_v30 });
try composer.say("Hello<|5s|>Nice to meet you<|1s|>Today<|2s|>how does the weather feel?", .{});

const audio = try composer.generate(allocator);
defer allocator.free(audio.audio_data);

Multi-speaker composition

Use the composer chaining API when one output file needs different voices or per-segment options such as pitch, tempo, prompt, or seed. The composer generates each segment as WAV, trims leading/trailing silent PCM samples, and concatenates the result. If you need MP3, generate WAV first and convert it in your app or server pipeline.

var composer = client.composeSpeech();
defer composer.deinit();

try composer.defaults(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .model = .ssfm_v30,
});
try composer.say("Hello there", .{});
try composer.pause(5);
try composer.say("Nice to meet you", .{
    .voice_id = "tc_60e5426de8b95f1d3000d7b5",
    .output = .{ .audio_pitch = 2 },
});
try composer.pause(2);
try composer.say("How does the weather feel?", .{});

const audio = try composer.generate(.wav);
defer audio.deinit(allocator);
try std.fs.cwd().writeFile(.{ .sub_path = "conversation.wav", .data = audio.audio_data });

Voice Discovery (V2 API)

List and filter available voices with enhanced metadata:

// Get all voices
const voices = try client.getVoicesV2(null);
defer allocator.free(voices);

// Filter by model
const filtered = try client.getVoicesV2(.{ .model = .ssfm_v30 });
defer allocator.free(filtered);

for (voices) |voice| {
    std.debug.print("ID: {s}, Name: {s}\n", .{ voice.voice_id, voice.voice_name });
}

// Get a specific voice
const voice = try client.getVoiceV2("tc_672c5f5ce59fac2a48faeaee", null);
std.debug.print("Voice: {s}\n", .{voice.voice_name});

Streaming

Stream audio chunks in real-time for low-latency playback via callback:

try client.textToSpeechStream(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .text = "Stream this text as audio in real time.",
    .model = .ssfm_v30,
}, struct {
    var first = true;
    fn onChunk(chunk: []const u8) anyerror!void {
        var data = chunk;
        if (first) {
            data = chunk[44..]; // Skip 44-byte WAV header
            first = false;
        }
        // data is raw 16-bit mono PCM at 32000 Hz
        // Feed to your audio output
    }
}.onChunk);

WAV streaming format: 32000 Hz, 16-bit, mono PCM. The first chunk includes a 44-byte WAV header (size = 0xFFFFFFFF); subsequent chunks are raw PCM only. For MP3: 320 kbps, 44100 Hz, each chunk is independently decodable.

Timestamp TTS

textToSpeechWithTimestamps() wraps POST /v1/text-to-speech/with-timestamps and returns the audio together with per-word and per-character alignment data — useful for karaoke highlights, subtitle generation, and lip-sync applications.

Basic Usage

const std = @import("std");
const typecast = @import("typecast");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var client = typecast.Client.init(allocator, .{
        .api_key = std.posix.getenv("TYPECAST_API_KEY") orelse return error.MissingApiKey,
    });
    defer client.deinit();

    const result = try client.textToSpeechWithTimestamps(.{
        .voice_id = "tc_60e5426de8b95f1d3000d7b5",
        .text     = "Hello. How are you?",
        .model    = .ssfm_v30,
    });
    defer allocator.free(result.audio_data);

    const file = try std.fs.cwd().createFile("output.wav", .{});
    defer file.close();
    try file.writeAll(result.audio_data);

    std.debug.print("Duration: {d:.3}s\n", .{result.audio_duration});

    for (result.words) |w| {
        std.debug.print("  [{d:.3}s – {d:.3}s] {s}\n", .{ w.start_time, w.end_time, w.text });
    }
}

Granularity

Set granularity: .word (default) or granularity: .char to control the alignment unit.

const result = try client.textToSpeechWithTimestamps(.{
    .voice_id    = "tc_60e5426de8b95f1d3000d7b5",
    .text        = "Hello. How are you?",
    .model       = .ssfm_v30,
    .granularity = .char,  // required for jpn / zho
});

Subtitle Export

const srt = try result.toSrt(allocator);
defer allocator.free(srt);
try std.fs.cwd().writeFile("output.srt", srt);

const vtt = try result.toVtt(allocator);
defer allocator.free(vtt);
try std.fs.cwd().writeFile("output.vtt", vtt);

Japanese / Chinese: Word-level segmentation is not meaningful for languages without whitespace delimiters (jpn, zho). Use .char granularity for these languages to get character-level alignment.

Instant Voice Cloning

Clone a custom voice from a short audio sample, then pass the returned uc_ voice ID directly to TTS.

const audio_file = try std.fs.cwd().openFile("sample.wav", .{});
defer audio_file.close();
const audio = try audio_file.readToEndAlloc(allocator, typecast.CLONING_MAX_FILE_SIZE);
defer allocator.free(audio);

const voice = try client.cloneVoice(
    allocator,
    audio,
    "sample.wav",
    "My Voice",
    "ssfm-v30",
);
defer {
    allocator.free(voice.voice_id);
    allocator.free(voice.name);
    allocator.free(voice.model);
}

const response = try client.textToSpeech(.{
    .voice_id = voice.voice_id,
    .text = "Hello from my cloned voice!",
    .model = .ssfm_v30,
});
defer allocator.free(response.audio_data);

try client.deleteVoice(voice.voice_id);

Voice cloning audio must be 25 MB or smaller, the audio duration must be 5-150 seconds, and the custom voice name must be 1-30 characters.

Supported Languages

The SDK supports 37 languages with automatic language detection:

Code	Language	Code	Language	Code	Language
`eng`	English	`jpn`	Japanese	`ukr`	Ukrainian
`kor`	Korean	`ell`	Greek	`ind`	Indonesian
`spa`	Spanish	`tam`	Tamil	`dan`	Danish
`deu`	German	`tgl`	Tagalog	`swe`	Swedish
`fra`	French	`fin`	Finnish	`msa`	Malay
`ita`	Italian	`zho`	Chinese	`ces`	Czech
`pol`	Polish	`slk`	Slovak	`por`	Portuguese
`nld`	Dutch	`ara`	Arabic	`bul`	Bulgarian
`rus`	Russian	`hrv`	Croatian	`ron`	Romanian
`ben`	Bengali	`hin`	Hindi	`hun`	Hungarian
`nan`	Hokkien	`nor`	Norwegian	`pan`	Punjabi
`tha`	Thai	`tur`	Turkish	`vie`	Vietnamese
`yue`	Cantonese

If not specified, the language will be automatically detected from the input text.

Error Handling

The SDK uses Zig’s error union for handling API errors:

const response = client.textToSpeech(.{
    .voice_id = "tc_672c5f5ce59fac2a48faeaee",
    .text = "Hello",
    .model = .ssfm_v30,
}) catch |err| switch (err) {
    error.Unauthorized => {
        std.debug.print("Invalid API key\n", .{});
        return err;
    },
    error.PaymentRequired => {
        std.debug.print("Insufficient credits\n", .{});
        return err;
    },
    error.RateLimited => {
        std.debug.print("Rate limit exceeded - please retry later\n", .{});
        return err;
    },
    error.NotFound => {
        std.debug.print("Voice not found\n", .{});
        return err;
    },
    else => return err,
};
defer allocator.free(response.audio_data);

Error Types

Error	Status Code	Description
`error.BadRequest`	400	Invalid request parameters
`error.Unauthorized`	401	Invalid or missing API key
`error.PaymentRequired`	402	Insufficient credits
`error.NotFound`	404	Resource not found
`error.UnprocessableEntity`	422	Validation error
`error.RateLimited`	429	Rate limit exceeded
`error.InternalServerError`	500	Server error
`error.JsonParseError`	-	JSON parsing error

API Reference

Client Methods

Method	Description
`init(allocator, config)`	Create client with configuration
`deinit()`	Clean up client resources
`textToSpeech(request)`	Convert text to speech audio
`generateToFile(path, request)`	Generate speech and save it directly to a local file
`textToSpeechStream(request, callback)`	Stream audio chunks via callback
`cloneVoice(allocator, audio, filename, name, model)`	Create a custom voice via instant cloning
`deleteVoice(voice_id)`	Delete a custom cloned voice
`getMySubscription()`	Get subscription info
`getVoices(model)`	Get available voices (V1)
`getVoicesV2(filter)`	Get voices with metadata (V2)
`getVoiceV2(voice_id, model)`	Get a specific voice

Source Code

Package

​Installation

​Quick Start

​Features

​Voice Recommendations

​Configuration

​Advanced Usage

​Emotion Control (ssfm-v30)

​Audio Customization

​Generate audio to a file

​Text pauses

​Multi-speaker composition

​Voice Discovery (V2 API)

​Streaming

​Timestamp TTS

​Basic Usage

​Granularity

​Subtitle Export

​Instant Voice Cloning

​Supported Languages

​Error Handling

​Error Types

​API Reference

​Client Methods

Installation

Quick Start

Features

Voice Recommendations

Configuration

Advanced Usage

Emotion Control (ssfm-v30)

Audio Customization

Generate audio to a file

Text pauses

Multi-speaker composition

Voice Discovery (V2 API)

Streaming

Timestamp TTS

Basic Usage

Granularity

Subtitle Export

Instant Voice Cloning

Supported Languages

Error Handling

Error Types

API Reference

Client Methods