Access the Typecast API with our official Ruby SDK.
The official Ruby SDK for the Typecast API. Convert text to lifelike speech using AI-powered voices, generate timestamps, list voices, and create custom voices.The Ruby SDK uses only the Ruby standard library at runtime and supports Ruby 2.6+.
Set your API key via environment variable or constructor:
export TYPECAST_API_KEY="your-api-key-here"
When requests go through your own proxy, set base_url to the proxy endpoint and omit api_key. The SDK will not send the X-API-KEY header for empty or missing keys. Requests to the default Typecast host still require an API key.
ssfm-v30 offers two emotion control modes: Preset and Smart.
Smart Mode
Preset Mode
Let the AI infer emotion from context:
response = client.text_to_speech( Typecast::Models::TTSRequest.new( voice_id: "tc_672c5f5ce59fac2a48faeaee", text: "Everything is going to be okay.", model: Typecast::Models::TTS_MODEL_V30, prompt: Typecast::Models::SmartPrompt.new( previous_text: "I just got the best news!", next_text: "I can't wait to celebrate!" ) ))
Explicitly set emotion with preset values:
response = client.text_to_speech( Typecast::Models::TTSRequest.new( voice_id: "tc_672c5f5ce59fac2a48faeaee", text: "I am so excited to show you these features!", model: Typecast::Models::TTS_MODEL_V30, prompt: Typecast::Models::PresetPrompt.new( emotion_preset: "happy", emotion_intensity: 1.5 ) ))
Use generate_to_file when you want the SDK to synthesize speech and write the audio bytes directly to a local file. The model defaults to ssfm-v30, and .mp3 / .wav extensions infer the output format when no output format is set. Browse available voice IDs on the Voices page.
client.generate_to_file( 'output.mp3', text: 'Hello from Typecast.', voice_id: 'tc_672c5f5ce59fac2a48faeaee' # Find voice IDs at https://typecast.ai/developers/api/voices)
Use text pause markup when you only need silent gaps inside one composed text segment. Put <|5s|>, <|1s|>, <|0.3s|>, or <|0.34413s|> directly in the text. The value is interpreted as seconds and must end with s. This keeps the pause expression visible in plain text without adding separate pause calls.
audio = client .compose_speech .defaults(voice_id: "tc_672c5f5ce59fac2a48faeaee", model: "ssfm-v30") .say("Hello<|5s|>Nice to meet you<|1s|>Today<|2s|>how does the weather feel?") .generate
Use the composer chaining API when one output file needs different voices or per-segment options such as pitch, tempo, prompt, or seed. The composer generates each segment as WAV, trims leading/trailing silent PCM samples, and concatenates the result. If you need MP3, generate WAV first and convert it in your app or server pipeline.
Use text_to_speech_stream() to call the streaming endpoint:
client.text_to_speech_stream( Typecast::Models::TTSRequestStream.new( voice_id: "tc_672c5f5ce59fac2a48faeaee", text: "Stream this text as audio.", model: Typecast::Models::TTS_MODEL_V30, output: Typecast::Models::OutputStream.new(audio_format: "wav") )) do |audio| File.binwrite("stream.wav", audio)end
WAV streaming format: 32000 Hz, 16-bit, mono PCM. The first chunk includes a 44-byte WAV header (size = 0xFFFFFFFF); subsequent chunks are raw PCM only.
text_to_speech_with_timestamps() wraps POST /v1/text-to-speech/with-timestamps and returns audio together with word- or character-level alignment data.
result = client.text_to_speech_with_timestamps( Typecast::Models::TTSRequest.new( voice_id: "tc_60e5426de8b95f1d3000d7b5", text: "Hello. How are you?", model: Typecast::Models::TTS_MODEL_V30 ))result.save_audio("output.wav")puts "Duration: #{result.audio_duration}s"result.words.each do |word| puts "[#{word.start_time}s - #{word.end_time}s] #{word.word}"end