Gemini API: Gemini Text-to-speech

The Gemini API can transform text input into single speaker or multi-speaker audio (podcast-like experience like in NotebookLM). This notebook provides an example of how to control the Text-to-speech (TTS) capability of the Gemini model and guide its style, accent, pace, and tone.

Before diving in the code, you should try this capability on AI Studio.

Note that the TTS model can only do TTS, it does not have the reasoning capabilities of the Gemini models, so you can ask things like “say this in that style”, but not “tell me why the sky is blue”. If that’s what you want, you should use the Live API instead.

The documentation is also a good place to start discovering the TTS capability.

Important

Audio-out is a preview feature. It is free to use for now with quota limitations, but is subject to change.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── quickstarts
    └── Get_started_TTS.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Select a model

Audio-out is only supported by the “tts” models, gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts. For more information about all Gemini models, check the documentation for extended information on each of them.

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-tts";

Utilites

The simplest way to playback the audio in Colab, is to write it out to a .wav file. So here is a simple wave file writer:

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");
const wave = require("wavefile") as typeof import("wavefile");

function saveAudioToFile(audioData: Int16Array, filePath: string) {
  fs.mkdirSync(path.dirname(filePath), { recursive: true });
  const wav = new wave.WaveFile();
  wav.fromScratch(1, 24000, "16", audioData);
  fs.writeFileSync(filePath, wav.toBuffer());
  console.debug(`Audio saved to ${filePath}`);
}

function base64ToInt16Array(base64String: string): Int16Array {
  const buffer = Buffer.from(base64String, "base64");
  const int16Array = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / Int16Array.BYTES_PER_ELEMENT);
  return int16Array;
}
import { GenerateContentResponse } from "@google/genai";

function playAudio(response: GenerateContentResponse, filePath: string) {
  if (response.candidates?.[0]?.content) {
    const response_content = response.candidates[0].content;
    if (response_content.parts) {
      const response_blob = response_content.parts[0].inlineData;
      if (response_blob?.data) {
        const response_filepath = path.join("../assets/tts", filePath);
        saveAudioToFile(base64ToInt16Array(response_blob.data), response_filepath);
        tslab.display.html(`
                  <audio controls>
                      <source src="${response_filepath}" type="audio/wav">
                      Your browser does not support the audio element.
                  </audio>
              `);
      }
    }
  }
}

Generate a simple audio output

Let’s start with something simple:

const simple_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: ["Say 'hello, my name is Gemini!'"],
  config: {
    responseModalities: [google.Modality.AUDIO],
  },
});

The generated ouput is in the response inlineData and as you can see it’s indeed audio data.

playAudio(simple_response, `simple_response.wav`);
Audio saved to ../assets/tts/simple_response.wav
Note

Note that the model can only do TTS, so you should always tell it to “say”, “read”, “TTS” something, otherwise it won’t do anything.

Control how the model speaks

There are 30 different built-in voices you can use and 24 supported languages which gives you plenty of combinations to try.

Choose a voice

Choose a voice among the 30 different ones. You can find their characteristics in the documentation.

const VOICE_ID = "Leda";
const custom_voice_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `Say "I am a very knowlegeable model, especially when using grounding", wait 5 seconds then say "Don't you think?".`,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: {
          voiceName: VOICE_ID,
        },
      },
    },
  },
});
playAudio(custom_voice_response, `custom_voice_response.wav`);
Audio saved to ../assets/tts/custom_voice_response.wav

Change the language

Just tell the model to speak in a certain language and it will. The documentation lists all the supported ones.

const custom_language_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `
    Read this in French:

    Les chaussettes de l'archiduchesse sont-elles sèches ? Archi-sèches ?
    Un chasseur sachant chasser doit savoir chasser sans son chien.
    `,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
  },
});
playAudio(custom_language_response, `custom_language_response.wav`);
Audio saved to ../assets/tts/custom_language_response.wav

Prompt the model to speak in certain ways

You can control style, tone, accent, and pace using natural language prompts, for example:

const custom_style_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `
    Say in an spooky whisper:
    "By the pricking of my thumbs...
    Something wicked this way comes!"
    `,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
  },
});
playAudio(custom_style_response, `custom_style_response.wav`);
Audio saved to ../assets/tts/custom_style_response.wav
const custom_pace_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `
    Read this disclaimer in as fast a voice as possible while remaining intelligible:

    [The author] assumes no responsibility or liability for any errors or omissions in the content of this site.
    The information contained in this site is provided on an 'as is' basis with no guarantees of completeness, accuracy, usefulness or timeliness
    `,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
  },
});
playAudio(custom_pace_response, `custom_pace_response.wav`);
Audio saved to ../assets/tts/custom_pace_response.wav

Mutlti-speakers

The TTS model can also read discussions between 2 speakers (like NotebookLM podcast feature). You just need to tell it that there are two speakers:

const multi_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `
    Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:

    Speaker1: So... what's on the agenda today?
    Speaker2: You're never going to guess!
    `,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
  },
});
playAudio(multi_response, `multi_response.wav`);
Audio saved to ../assets/tts/multi_response.wav

You can also select the voices for each participants and pass their names to the model.

But first let’s generate a discussion between two scientists:

const multi_speaker_transcript = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-05-20",
  contents: [
    `
    Hi, please generate a short (like 100 words) transcript that reads like
    it was clipped from a podcast by excited herpetologists, Dr. Claire and
    her assistant, the young Aurora.
    `,
  ],
});
tslab.display.markdown(multi_speaker_transcript.text ?? "");

(Sound of distant jungle chirps fading, replaced by a slightly crackly podcast mic)

Dr. Claire: (Practically bouncing) I’m still buzzing, Aurora! Genuinely buzzing!

Aurora: (Squealing) Dr. Claire, my hands are still shaking! The Emerald Mossback! We actually saw one!

Dr. Claire: Not just saw it, Aurora! We documented its full iridescent display! The way it melded with the bromeliads… it was indistinguishable! The camouflage… it’s beyond anything in the textbooks!

Aurora: It was like watching magic! No wonder they were thought extinct for decades! Who could ever spot that? My heart was pounding!

Dr. Claire: Precisely! This changes our entire understanding of its crypsis. This episode is going to be legendary for ‘Reptile Revelations’!

Aurora: Totally! Best day ever, Dr. Claire! Best. Day. Ever!

const text = multi_speaker_transcript.text ?? "";
const transcript_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    `TTS the following conversation between a very excited Dr. Claire and her assistant, the young Aurora: ${text}`,
  ],
  config: {
    responseModalities: [google.Modality.AUDIO],
    speechConfig: {
      multiSpeakerVoiceConfig: {
        speakerVoiceConfigs: [
          {
            speaker: "Dr. Claire",
            voiceConfig: {
              prebuiltVoiceConfig: {
                voiceName: "Aoede",
              },
            },
          },
          {
            speaker: "Aurora",
            voiceConfig: {
              prebuiltVoiceConfig: {
                voiceName: "Leda",
              },
            },
          },
        ],
      },
    },
  },
});
playAudio(transcript_response, `transcript_response.wav`);
Audio saved to ../assets/tts/transcript_response.wav

What’s next?

Now that you know how to generate multi-speaker conversations, here are other cool things to try:

  • Instead of speech, learn how to generate music conversation using the Lyria RealTime.
  • Discover how to generate images or videos.
  • Instead of generation music or audio, find out how to Gemini can understand Audio files.
  • Have a real-time conversation with Gemini using the Live API.