Gemini API: Context Caching Quickstart

This notebook introduces context caching with the Gemini API and provides examples of interacting with the Apollo 11 transcript using the JS SDK. For a more comprehensive look, check out the caching guide.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── quickstarts
    └── Caching.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Select a model

Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. thinking notebook for more details and in particular learn how to switch the thiking off).

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-05-20";

Upload a file

A common pattern with the Gemini API is to ask a number of questions of the same document. Context caching is designed to assist with this case, and can be more efficient by avoiding the need to pass the same tokens through the model for each new request.

This example will be based on the transcript from the Apollo 11 mission.

Start by downloading that transcript.

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");

const TEXT_FILE_URL = "https://storage.googleapis.com/generativeai-downloads/data/a11.txt";

const downloadFile = async (url: string, filePath: string) => {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error(`Failed to download file: ${response.statusText}`);
  }
  const buffer = await response.blob();
  const bufferData = Buffer.from(await buffer.arrayBuffer());
  fs.writeFileSync(filePath, bufferData);
};

const textFilePath = path.join("../assets", "a11.txt");
await downloadFile(TEXT_FILE_URL, textFilePath);

Now upload the transcript using the File API.

const text_file = await ai.files.upload({
  file: textFilePath,
  config: {
    displayName: "a11.txt",
    mimeType: "text/plain",
  },
});

Cache the prompt

Next create a CachedContent object specifying the prompt you want to use, including the file and other fields you wish to cache. In this example the systemInstruction has been set, and the document was provided in the prompt.

Note that caches are model specific. You cannot use a cache made with a different model as their tokenization might be slightly different.

const apollo_cache = await ai.caches.create({
  model: "gemini-2.5-flash-preview-05-20",
  config: {
    contents: [google.createPartFromUri(text_file.uri ?? "", text_file.mimeType ?? "text/plain")],
    systemInstruction: "You are an expert at analyzing transcripts.",
  },
});
console.log(JSON.stringify(apollo_cache, null, 2));
{
  "name": "cachedContents/or2aw5wd2llyqw6dwjebcx3jugv0dl85rir0zfsh",
  "displayName": "",
  "model": "models/gemini-2.5-flash-preview-05-20",
  "createTime": "2025-06-12T16:52:30.246524Z",
  "updateTime": "2025-06-12T16:52:30.246524Z",
  "expireTime": "2025-06-12T17:52:29.753060927Z",
  "usageMetadata": {
    "totalTokenCount": 323384
  }
}
tslab.display.markdown(
  `As you can see in the output, you just cached **${apollo_cache.usageMetadata?.totalTokenCount}** tokens.`
);

As you can see in the output, you just cached 323384 tokens.

Manage the cache expiry

Once you have a CachedContent object, you can update the expiry time to keep it alive while you need it.

const updated_apollo_cache = await ai.caches.update({
  name: apollo_cache.name ?? "",
  config: {
    ttl: "7200s",
  },
});
console.log(JSON.stringify(updated_apollo_cache, null, 2));
{
  "name": "cachedContents/or2aw5wd2llyqw6dwjebcx3jugv0dl85rir0zfsh",
  "displayName": "",
  "model": "models/gemini-2.5-flash-preview-05-20",
  "createTime": "2025-06-12T16:52:30.246524Z",
  "updateTime": "2025-06-12T16:52:30.825743Z",
  "expireTime": "2025-06-12T18:52:30.785032103Z",
  "usageMetadata": {
    "totalTokenCount": 323384
  }
}

Use the cache for generation

To use the cache for generation, you can pass the CachedContent object name to the cachedContentName field in the generation request. This will allow the model to use the cached content for the generation, avoiding the need to pass the same tokens through the model again.

const transcript_response = await ai.models.generateContent({
  model: MODEL_ID,
  contents: ["Find a lighthearted moment from this transcript"],
  config: {
    cachedContent: updated_apollo_cache.name ?? "",
  },
});
tslab.display.markdown(transcript_response.text ?? "");

One lighthearted moment occurs on page 20, around timestamp 00 03 48 45:

LMP: LM looks to be in pretty fine shape from about all we can see from here.

CC: Okay. In reference to your question on this step 13 on the decal, I understand that you have used up the contents of the REPRESS O2 package and at that time, instead of being up to 5 psi, you were reading 4.4. Is that correct?

CMP: Okay. 4.4. Yes sir.

CC: Okay. And you want to know if you can go ahead and use additional oxygen to bring the command module up to 5.0 and continue the equalization? Over.

CMP: Yes. We think it’s within normal tolerances, Bruce. We just wanted to get your concurrence before we press on with this procedure.

CC: Roger, Apollo 11. Go ahead.

CMP: Okay. We’re pressing on with the procedure.

CC: And 11, Houston. We have a request for you. On the service module secondary propellant fuel pressurization valve: As a precautionary measure, we’d like you to momentarily cycle the four switches to the CLOSE position and then release. As you know, we have no TM or talkback on these valve positions, and it’s conceivable that one of them might also have been moved into a different position by the shock of separation. Over.

CMP: Okay. Good idea. That’s being done.

CC: Houston. Roger. Out.

The humor comes from the casual, almost apologetic tone of the controller (“Good idea. That’s being done.”) in response to a seemingly trivial request to cycle some switches, creating a lighthearted contrast to the highly technical nature of the conversation. The astronauts’ matter-of-fact response further enhances this effect.

You can inspect token usage through usageMetadata. Note that the cached prompt tokens are included in promptTokenCount, but excluded from the totalTokenCount.

console.log(JSON.stringify(transcript_response.usageMetadata, null, 2));
{
  "promptTokenCount": 323392,
  "candidatesTokenCount": 423,
  "totalTokenCount": 323815,
  "cachedContentTokenCount": 323384,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 323392
    }
  ],
  "cacheTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 323384
    }
  ],
  "candidatesTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 423
    }
  ]
}
tslab.display.markdown(`
  As you can see in the \`usageMetadata\`, the token usage is split between:

  *  ${transcript_response.usageMetadata?.cachedContentTokenCount} tokens for the cache,
  *  ${transcript_response.usageMetadata?.promptTokenCount} tokens for the input (including the cache, so ${(transcript_response.usageMetadata?.promptTokenCount ?? 0) - (transcript_response.usageMetadata?.cachedContentTokenCount ?? 0)} for the actual prompt),
  *  ${transcript_response.usageMetadata?.candidatesTokenCount} tokens for the output,
  *  ${transcript_response.usageMetadata?.totalTokenCount} tokens in total.
`);

As you can see in the usageMetadata, the token usage is split between:

  • 323384 tokens for the cache,
  • 323392 tokens for the input (including the cache, so 8 for the actual prompt),
  • 423 tokens for the output,
  • 323815 tokens in total.

You can ask new questions of the model, and the cache is reused.

const chat = ai.chats.create({
  model: MODEL_ID,
  config: {
    cachedContent: updated_apollo_cache.name ?? "",
  },
});
const chat_response_1 = await chat.sendMessage({
  message: "Give me a quote from the most important part of the transcript.",
});
tslab.display.markdown(chat_response_1.text ?? "");

The most important part of the transcript is the moment of landing. The quote is:

“Houston, Tranquility Base here. The Eagle has landed.”

const chat_response_2 = await chat.sendMessage({
  message: "What was recounted after that?",
  config: {
    cachedContent: updated_apollo_cache.name ?? "",
  },
});
tslab.display.markdown(chat_response_2.text ?? "");

Immediately following the announcement “Houston, Tranquility Base here. The Eagle has landed.”, the following events and communications are recounted in the transcript:

  • Confirmation from Houston: Mission Control’s response was, “Roger, Tranquility. We copy you on the ground. You got a bunch of guys about to turn blue. We’re breathing again. Thanks a lot.” This expresses the immense relief and joy at Mission Control.

  • Events in the Lunar Module: The transcript then details actions taken by the astronauts inside the Lunar Module, including:

    • Armstrong and Aldrin confirming the MASTER ARM was ON.
    • Aldrin noting a “very smooth touchdown.”
    • Aldrin reporting the venting of oxidizer.
    • Confirmation from Houston that Eagle was to STAY for T1 (the first planned post-landing activity).
    • Further actions taken to secure the Lunar Module after landing.
  • Communications with Columbia: Michael Collins in the Command Module Columbia is heard confirming that he heard the landing. There’s a brief exchange with Houston and Collins about Columbia’s status.

In short, the immediate aftermath of the landing announcement focuses on confirmation of the successful landing, actions to secure the Lunar Module, and initial communications with the Command Module and Mission Control, highlighting the relief and the transition to the next phase of the mission.

console.log(JSON.stringify(chat_response_2.usageMetadata, null, 2));
{
  "promptTokenCount": 323439,
  "candidatesTokenCount": 287,
  "totalTokenCount": 323726,
  "cachedContentTokenCount": 323384,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 323439
    }
  ],
  "cacheTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 323384
    }
  ],
  "candidatesTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 287
    }
  ]
}
tslab.display.markdown(`
  As you can see in the \`usageMetadata\`, the token usage is split between:

  *  ${chat_response_2.usageMetadata?.cachedContentTokenCount} tokens for the cache,
  *  ${chat_response_2.usageMetadata?.promptTokenCount} tokens for the input (including the cache, so ${(chat_response_2.usageMetadata?.promptTokenCount ?? 0) - (chat_response_2.usageMetadata?.cachedContentTokenCount ?? 0)} for the actual prompt),
  *  ${chat_response_2.usageMetadata?.candidatesTokenCount} tokens for the output,
  *  ${chat_response_2.usageMetadata?.totalTokenCount} tokens in total.
`);

As you can see in the usageMetadata, the token usage is split between:

  • 323384 tokens for the cache,
  • 323439 tokens for the input (including the cache, so 55 for the actual prompt),
  • 287 tokens for the output,
  • 323726 tokens in total.

Since the cached tokens are cheaper than the normal ones, it means this prompt was much cheaper that if you had not used caching. Check the pricing here for the up-to-date discount on cached tokens.

Delete the cache

The cache has a small recurring storage cost (cf. pricing) so by default it is only saved for an hour. In this case you even set it up for a shorter amont of time (using "ttl") of 2h.

Still, if you don’t need you cache anymore, it is good practice to delete it proactively.

console.log(updated_apollo_cache.name ?? "");
await ai.caches.delete({
  name: updated_apollo_cache.name ?? "",
});
tslab.display.markdown("Cache deleted successfully.");
cachedContents/or2aw5wd2llyqw6dwjebcx3jugv0dl85rir0zfsh

Cache deleted successfully.

Next Steps

Useful API references:

If you want to know more about the caching API, you can check the full API specifications and the caching documentation.

Continue your discovery of the Gemini API

Check the File API notebook to know more about that API. The vision capabilities of the Gemini API are a good reason to use the File API and the caching. The Gemini API also has configurable safety settings that you might have to customize when dealing with big files.