Access the Typecast API with our official Swift SDK.
The official Swift library for the Typecast API. Convert text to lifelike speech using AI-powered voices.Compatible with Swift 5.9+ and supports all Apple platforms: iOS, macOS, tvOS, watchOS, and visionOS.
Enter the repository URL: https://github.com/neosapience/typecast-sdk.git
Select version rules and click Add Package
Select the Typecast library and add it to your target
Latest registered version: typecast-swift/v0.3.1 in the SDK Git tags. Make sure you have Swift 5.9 or higher installed. The SDK uses Swift Concurrency (async/await) which requires this minimum version.
ssfm-v30 offers two emotion control modes: Preset and Smart.
Smart Mode
Preset Mode
Convenience Method
Let the AI infer emotion from context:
let request = TTSRequest( voiceId: "tc_672c5f5ce59fac2a48faeaee", text: "Everything is going to be okay.", model: .ssfmV30, prompt: .smart(SmartPrompt( previousText: "I just got the best news!", // Optional context nextText: "I can't wait to celebrate!" // Optional context )))let response = try await client.textToSpeech(request)audioPlayer = try AVAudioPlayer(data: response.audioData)audioPlayer?.play()
Explicitly set emotion with preset values:
let request = TTSRequest( voiceId: "tc_672c5f5ce59fac2a48faeaee", text: "I am so excited to show you these features!", model: .ssfmV30, prompt: .preset(PresetPrompt( emotionPreset: .happy, // normal, happy, sad, angry, whisper, toneup, tonedown emotionIntensity: 1.5 // Range: 0.0 to 2.0 )))let response = try await client.textToSpeech(request)audioPlayer = try AVAudioPlayer(data: response.audioData)audioPlayer?.play()
Use the convenience method for quick emotion control:
let audio = try await client.speak( "I'm so excited!", voiceId: "tc_672c5f5ce59fac2a48faeaee", emotion: .happy, intensity: 1.5)audioPlayer = try AVAudioPlayer(data: audio.audioData)audioPlayer?.play()
Stream audio chunks in real-time for low-latency playback:
import AVFoundationimport Typecastlet engine = AVAudioEngine()let playerNode = AVAudioPlayerNode()let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 32000, channels: 1, interleaved: true)!engine.attach(playerNode)engine.connect(playerNode, to: engine.mainMixerNode, format: format)try engine.start()playerNode.play()let stream = try await client.textToSpeechStream(request)var first = truefor try await chunk in stream { var pcmData = chunk if first { pcmData = chunk.dropFirst(44) // Skip 44-byte WAV header first = false } let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(pcmData.count / 2))! buffer.frameLength = buffer.frameCapacity pcmData.withUnsafeBytes { ptr in buffer.int16ChannelData!.pointee.update(from: ptr.bindMemory(to: Int16.self).baseAddress!, count: Int(buffer.frameLength)) } playerNode.scheduleBuffer(buffer)}
WAV streaming format: 32000 Hz, 16-bit, mono PCM. The first chunk includes a 44-byte WAV header (size = 0xFFFFFFFF); subsequent chunks are raw PCM only. For MP3: 320 kbps, 44100 Hz, each chunk is independently decodable. Use Typecast.OutputStream to avoid collision with Foundation.OutputStream.
textToSpeechWithTimestamps() wraps POST /v1/text-to-speech/with-timestamps and returns the audio together with per-word and per-character alignment data — useful for karaoke highlights, subtitle generation, and lip-sync applications.
Pass granularity: .word (default) or granularity: .char to control the alignment unit.
let request = TTSRequestWithTimestamps( voiceId: "tc_60e5426de8b95f1d3000d7b5", text: "Hello. How are you?", model: .ssfmV30, granularity: .char // required for Japanese / Chinese)
let srt = result.toSrt()print(srt)let vtt = result.toVtt()print(vtt)
Japanese / Chinese: Word-level segmentation is not meaningful for languages without whitespace delimiters (jpn, zho). Use .char granularity for these languages to get character-level alignment.