The Gemini API can transform text input into single speaker or multi-speaker audio (podcast-like experience like in NotebookLM). This notebook provides an example of how to control the Text-to-speech (TTS) capability of the Gemini model and guide its style, accent, pace, and tone.
Before diving in the code, you should try this capability on AI Studio.
Note that the TTS model can only do TTS, it does not have the reasoning capabilities of the Gemini models, so you can ask things like “say this in that style”, but not “tell me why the sky is blue”. If that’s what you want, you should use the Live API instead.
The documentation is also a good place to start discovering the TTS capability.
Important
Audio-out is a preview feature. It is free to use for now with quota limitations, but is subject to change.
You can create your API key using Google AI Studio with a single click.
Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.
Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:
$ export GEMINI_API_KEY="<YOUR_API_KEY>"
Load the API key
To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.
$ npm install dotenv
Then, we can load the API key in our code:
const dotenv =require("dotenv") astypeofimport("dotenv");dotenv.config({ path:"../.env",});const GEMINI_API_KEY =process.env.GEMINI_API_KEY??"";if (!GEMINI_API_KEY) {thrownewError("GEMINI_API_KEY is not set in the environment variables");}console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note
In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.
With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.
const google =require("@google/genai") astypeofimport("@google/genai");const ai =new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });
Select a model
Audio-out is only supported by the “tts” models, gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts. For more information about all Gemini models, check the documentation for extended information on each of them.
import { GenerateContentResponse } from"@google/genai";functionplayAudio(response: GenerateContentResponse, filePath:string) {if (response.candidates?.[0]?.content) {const response_content = response.candidates[0].content;if (response_content.parts) {const response_blob = response_content.parts[0].inlineData;if (response_blob?.data) {const response_filepath = path.join("../assets/tts", filePath);saveAudioToFile(base64ToInt16Array(response_blob.data), response_filepath); tslab.display.html(` <audio controls> <source src="${response_filepath}" type="audio/wav"> Your browser does not support the audio element. </audio> `); } } }}
Generate a simple audio output
Let’s start with something simple:
const simple_response =await ai.models.generateContent({ model: MODEL_ID, contents: ["Say 'hello, my name is Gemini!'"], config: { responseModalities: [google.Modality.AUDIO], },});
The generated ouput is in the response inlineData and as you can see it’s indeed audio data.
playAudio(simple_response,`simple_response.wav`);
Audio saved to ../assets/tts/simple_response.wav
Note
Note that the model can only do TTS, so you should always tell it to “say”, “read”, “TTS” something, otherwise it won’t do anything.
Control how the model speaks
There are 30 different built-in voices you can use and 24 supported languages which gives you plenty of combinations to try.
Choose a voice
Choose a voice among the 30 different ones. You can find their characteristics in the documentation.
const VOICE_ID ="Leda";
const custom_voice_response =await ai.models.generateContent({ model: MODEL_ID, contents: [`Say "I am a very knowlegeable model, especially when using grounding", wait 5 seconds then say "Don't you think?".`, ], config: { responseModalities: [google.Modality.AUDIO], speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName: VOICE_ID, }, }, }, },});playAudio(custom_voice_response,`custom_voice_response.wav`);
Audio saved to ../assets/tts/custom_voice_response.wav
Change the language
Just tell the model to speak in a certain language and it will. The documentation lists all the supported ones.
const custom_language_response =await ai.models.generateContent({ model: MODEL_ID, contents: [` Read this in French: Les chaussettes de l'archiduchesse sont-elles sèches ? Archi-sèches ? Un chasseur sachant chasser doit savoir chasser sans son chien. `, ], config: { responseModalities: [google.Modality.AUDIO], },});playAudio(custom_language_response,`custom_language_response.wav`);
Audio saved to ../assets/tts/custom_language_response.wav
Prompt the model to speak in certain ways
You can control style, tone, accent, and pace using natural language prompts, for example:
const custom_style_response =await ai.models.generateContent({ model: MODEL_ID, contents: [` Say in an spooky whisper: "By the pricking of my thumbs... Something wicked this way comes!" `, ], config: { responseModalities: [google.Modality.AUDIO], },});playAudio(custom_style_response,`custom_style_response.wav`);
Audio saved to ../assets/tts/custom_style_response.wav
const custom_pace_response =await ai.models.generateContent({ model: MODEL_ID, contents: [` Read this disclaimer in as fast a voice as possible while remaining intelligible: [The author] assumes no responsibility or liability for any errors or omissions in the content of this site. The information contained in this site is provided on an 'as is' basis with no guarantees of completeness, accuracy, usefulness or timeliness `, ], config: { responseModalities: [google.Modality.AUDIO], },});playAudio(custom_pace_response,`custom_pace_response.wav`);
Audio saved to ../assets/tts/custom_pace_response.wav
Mutlti-speakers
The TTS model can also read discussions between 2 speakers (like NotebookLM podcast feature). You just need to tell it that there are two speakers:
const multi_response =await ai.models.generateContent({ model: MODEL_ID, contents: [` Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy: Speaker1: So... what's on the agenda today? Speaker2: You're never going to guess! `, ], config: { responseModalities: [google.Modality.AUDIO], },});playAudio(multi_response,`multi_response.wav`);
Audio saved to ../assets/tts/multi_response.wav
You can also select the voices for each participants and pass their names to the model.
But first let’s generate a discussion between two scientists:
const multi_speaker_transcript =await ai.models.generateContent({ model:"gemini-2.5-flash-preview-05-20", contents: [` Hi, please generate a short (like 100 words) transcript that reads like it was clipped from a podcast by excited herpetologists, Dr. Claire and her assistant, the young Aurora. `, ],});tslab.display.markdown(multi_speaker_transcript.text??"");
(Sound of distant jungle chirps fading, replaced by a slightly crackly podcast mic)
Dr. Claire: (Practically bouncing) I’m still buzzing, Aurora! Genuinely buzzing!
Aurora: (Squealing) Dr. Claire, my hands are still shaking! The Emerald Mossback! We actually saw one!
Dr. Claire: Not just saw it, Aurora! We documented its full iridescent display! The way it melded with the bromeliads… it was indistinguishable! The camouflage… it’s beyond anything in the textbooks!
Aurora: It was like watching magic! No wonder they were thought extinct for decades! Who could ever spot that? My heart was pounding!
Dr. Claire: Precisely! This changes our entire understanding of its crypsis. This episode is going to be legendary for ‘Reptile Revelations’!
Aurora: Totally! Best day ever, Dr. Claire! Best. Day. Ever!
const text = multi_speaker_transcript.text??"";const transcript_response =await ai.models.generateContent({ model: MODEL_ID, contents: [`TTS the following conversation between a very excited Dr. Claire and her assistant, the young Aurora: ${text}`, ], config: { responseModalities: [google.Modality.AUDIO], speechConfig: { multiSpeakerVoiceConfig: { speakerVoiceConfigs: [ { speaker:"Dr. Claire", voiceConfig: { prebuiltVoiceConfig: { voiceName:"Aoede", }, }, }, { speaker:"Aurora", voiceConfig: { prebuiltVoiceConfig: { voiceName:"Leda", }, }, }, ], }, }, },});playAudio(transcript_response,`transcript_response.wav`);
Audio saved to ../assets/tts/transcript_response.wav
Voice Gallery
Here is a gallery of the voices you can use with the TTS model. You can use them in the voiceConfig parameter of the model call.
const audioMap:Record<string,string>= {};
const voices = ["Zephyr","Puck","Charon","Kore","Fenrir","Leda","Orus","Aoede","Callirrhoe","Autonoe","Enceladus","Iapetus","Umbriel","Algieba","Despina","Erinome","Algenib","Rasalgethi","Laomedeia","Achernar","Alnilam","Schedar","Gacrux","Pulcherrima","Achird","Zubenelgenubi","Vindemiatrix","Sadachbia","Sadaltager","Sulafat",];// generate for all voices and directly pass the base64 data to the audio playerfor (const voice of voices) {const filePath = path.join("../assets/voices",`${voice}.wav`);if (audioMap[voice] || fs.existsSync(filePath)) { audioMap[voice] = audioMap[voice] ? audioMap[voice] : fs.readFileSync(filePath,"base64");console.debug(`Audio for voice ${voice} already generated, skipping...`);continue; }const voice_response =await ai.models.generateContent({ model: MODEL_ID, contents: [`Say "Hello, I am a voice named ${voice}!"`], config: { responseModalities: [google.Modality.AUDIO], speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName: voice, }, }, }, }, }); audioMap[voice] = voice_response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data??"";saveAudioToFile(base64ToInt16Array(audioMap[voice]), filePath);}console.debug("Generated audio for all voices");
Audio for voice Zephyr already generated, skipping...
Audio for voice Puck already generated, skipping...
Audio for voice Charon already generated, skipping...
Audio for voice Kore already generated, skipping...
Audio for voice Fenrir already generated, skipping...
Audio for voice Leda already generated, skipping...
Audio for voice Orus already generated, skipping...
Audio for voice Aoede already generated, skipping...
Audio for voice Callirrhoe already generated, skipping...
Audio for voice Autonoe already generated, skipping...
Audio for voice Enceladus already generated, skipping...
Audio for voice Iapetus already generated, skipping...
Audio for voice Umbriel already generated, skipping...
Audio for voice Algieba already generated, skipping...
Audio for voice Despina already generated, skipping...
Audio for voice Erinome already generated, skipping...
Audio for voice Algenib already generated, skipping...
Audio for voice Rasalgethi already generated, skipping...
Audio for voice Laomedeia already generated, skipping...
Audio for voice Achernar already generated, skipping...
Audio for voice Alnilam already generated, skipping...
Audio for voice Schedar already generated, skipping...
Audio for voice Gacrux already generated, skipping...
Audio for voice Pulcherrima already generated, skipping...
Audio for voice Achird already generated, skipping...
Audio for voice Zubenelgenubi already generated, skipping...
Audio for voice Vindemiatrix already generated, skipping...
Audio for voice Sadachbia already generated, skipping...
Audio for voice Sadaltager already generated, skipping...
Audio for voice Sulafat already generated, skipping...
Generated audio for all voices
// make an html grid gallery of the audio filestslab.display.html(` <div style="display: grid; grid-template-columns: repeat(auto-fill, minmax(290px, 1fr)); gap: 10px;">${Object.keys(audioMap).map( (voice) =>` <div style="text-align: center;"> <audio controls> <source src="../assets/voices/${voice}.wav" type="audio/wav"> Your browser does not support the audio element. </audio> <p>${voice}</p> </div> ` ).join("")} </div>`);
Zephyr
Puck
Charon
Kore
Fenrir
Leda
Orus
Aoede
Callirrhoe
Autonoe
Enceladus
Iapetus
Umbriel
Algieba
Despina
Erinome
Algenib
Rasalgethi
Laomedeia
Achernar
Alnilam
Schedar
Gacrux
Pulcherrima
Achird
Zubenelgenubi
Vindemiatrix
Sadachbia
Sadaltager
Sulafat
What’s next?
Now that you know how to generate multi-speaker conversations, here are other cool things to try:
Instead of speech, learn how to generate music conversation using the Lyria RealTime.