Video understanding with Gemini

Gemini has from the begining been a multimodal model, capable of analyzing all sorts of medias using its long context window.

Gemini models bring video analysis to a whole new level as illustrated in this video.

This notebook will show you how to easily use Gemini to perform the same kind of video analysis. You can also check the live demo and try it on your own videos on AI Studio.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── quickstarts
    └── Video_understanding.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Select a model

Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. thinking notebook for more details and in particular learn how to switch the thiking off).

Video understanding works best with Gemini 2.5 models. You can also select former models to compare their behavior but it is recommended to use at least the 2.0 ones.

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-05-20";

Get sample videos

You will start with uploaded videos, as it’s a more common use-case, but you will also see later that you can also use Youtube videos.

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");

const downloadFile = async (url: string, filePath: string) => {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error(`Failed to download file: ${response.statusText}`);
  }
  const buffer = await response.blob();
  const bufferData = Buffer.from(await buffer.arrayBuffer());
  fs.writeFileSync(filePath, bufferData);
};

const videoDir = path.join("../assets", "video_understanding");
fs.mkdirSync(videoDir, { recursive: true });

const videoMapping = {
  "Pottery.mp4": "https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4",
  "Trailcam.mp4": "https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4",
  "Post_its.mp4": "https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4",
  "User_study.mp4": "https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4",
};
for (const [fileName, url] of Object.entries(videoMapping)) {
  const filePath = path.join(videoDir, fileName);
  if (!fs.existsSync(filePath)) {
    console.log(`Downloading ${fileName}...`);
    await downloadFile(url, filePath);
  } else {
    console.log(`${fileName} already exists, skipping download.`);
  }
}
Downloading Pottery.mp4...
Downloading Trailcam.mp4...
Downloading Post_its.mp4...
Downloading User_study.mp4...

Upload the videos

Upload all the videos using the File API. You can find modre details about how to use it in the Get Started notebook.

This can take a couple of minutes as the videos will need to be processed and tokenized.

import { File, FileState } from "@google/genai";

async function deferredFileUpload(filePath: string, config: { displayName: string }): Promise<File> {
  const file = await ai.files.upload({
    file: filePath,
    config,
  });
  let getFile = await ai.files.get({ name: file.name ?? "" });
  while (getFile.state === FileState.PROCESSING) {
    getFile = await ai.files.get({ name: file.name ?? "" });
    console.log(`current file status (${getFile.displayName}): ${getFile.state ?? "unknown"}`);
    console.log("File is still processing, retrying in 5 seconds");

    await new Promise((resolve) => {
      setTimeout(resolve, 5000);
    });
  }
  if (file.state === FileState.FAILED) {
    throw new Error("File processing failed.");
  }
  return file;
}
const pottery_video = await deferredFileUpload(path.join(videoDir, "Pottery.mp4"), { displayName: "Pottery Video" });
const trailcam_video = await deferredFileUpload(path.join(videoDir, "Trailcam.mp4"), { displayName: "Trailcam Video" });
const post_its_video = await deferredFileUpload(path.join(videoDir, "Post_its.mp4"), { displayName: "Post-its Video" });
const user_study_video = await deferredFileUpload(path.join(videoDir, "User_study.mp4"), {
  displayName: "User Study Video",
});
current file status (Pottery Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Post-its Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Pottery Video): ACTIVE
File is still processing, retrying in 5 seconds
current file status (Post-its Video): ACTIVE
File is still processing, retrying in 5 seconds
current file status (User Study Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Trailcam Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (User Study Video): ACTIVE
File is still processing, retrying in 5 seconds
current file status (Trailcam Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Trailcam Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Trailcam Video): ACTIVE
File is still processing, retrying in 5 seconds
current file status (Post-its Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (Post-its Video): ACTIVE
File is still processing, retrying in 5 seconds
current file status (User Study Video): PROCESSING
File is still processing, retrying in 5 seconds
current file status (User Study Video): ACTIVE
File is still processing, retrying in 5 seconds

Search within videos

First, try using the model to search within your videos and describe all the animal sightings in the trailcam video.

const trailcam_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.",
    google.createPartFromUri(trailcam_video.uri ?? "", trailcam_video.mimeType ?? ""),
  ],
});
tslab.display.markdown(trailcam_video_analysis.text ?? "");
[
 {
  "timecode": "00:00",
  "caption": "A fox's head comes into frame, looking around, then walks off-screen. Another fox quickly enters the frame from the right, walks to the left, and then disappears. A third fox enters from the right, walks to the left, and then disappears. The second fox comes into frame from the bottom right, walks forward, then stops to look at the other foxes. The first fox walks on the top of the rocks, then disappears off-screen. The second and third foxes walk into the middle of the frame and then disappear off-screen. (audio: sounds of birds, animals, and clicks)"
 },
 {
  "timecode": "00:17",
  "caption": "A mountain lion is walking slowly, sniffing the ground. The mountain lion then stops and looks around, then continues walking, sniffing the ground and then disappears off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "00:35",
  "caption": "Two foxes are walking in the forest at night, both looking at something on the ground. The fox on the left starts to roll around and appears to be playing with something on the ground, then stands up. The other fox starts to approach the fox on the left, then they both start chasing each other around. The fox on the right chases the fox on the left off-screen, then the fox on the left chases the fox on the right off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "00:50",
  "caption": "A camera flashes, revealing two foxes, one on the right and one on the left. The fox on the left is lying on the ground, then the fox on the right quickly runs towards it and off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "01:05",
  "caption": "A mountain lion's back is to the camera, looking around. It then jumps and disappears off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "01:18",
  "caption": "Two mountain lions are walking in the forest. The mountain lion in the foreground turns its head to the camera, then walks towards the camera and off-screen. The mountain lion in the background walks to the right and then disappears off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "01:29",
  "caption": "A bobcat is looking at something on the ground in the forest, then quickly looks up and around, then back to the ground. The bobcat then lies down, rolls around, and then stands back up and looks to the left. The bobcat then walks away. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "01:51",
  "caption": "A bear walks from the right side of the screen to the left. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "01:57",
  "caption": "A mountain lion runs from the left to the right side of the screen. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "02:05",
  "caption": "A black bear walks from the left to the right side of the screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "02:12",
  "caption": "A cub walks into the frame from the right side of the screen, followed by its mother. They both begin to sniff the ground and walk to the left. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "02:23",
  "caption": "A fox is walking along a rocky cliff at night. The fox stops to look down at the city lights, then starts to eat something off the ground. The fox then looks up at the city lights. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "02:35",
  "caption": "A bear is walking along a rocky cliff at night, then walks off-screen. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "02:42",
  "caption": "A mountain lion is walking along a rocky cliff at night, then walks off-screen. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "02:47",
  "caption": "A mountain lion is walking along a rocky cliff at night, then walks off-screen. (audio: sounds of birds and clicks)"
 },
 {
  "timecode": "02:52",
  "caption": "A mountain lion is walking in the forest at night, then stops to look at something under a tree. It then walks further under the tree and disappears off-screen. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "03:05",
  "caption": "A black bear is walking in the forest, then stops and starts to bite at its front leg. The bear walks forward. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "03:22",
  "caption": "A bear cub walks into the frame, starts to bite at its leg, then another bear cub walks in behind it. They both stop and look around, then start to walk. The second bear cub sits down, scratches its back, then the first bear cub walks towards it and then off-screen. The second bear cub then follows the first one. (audio: sounds of animals and clicks)"
 },
 {
  "timecode": "04:22",
  "caption": "A bobcat walks from the right side of the screen and into the center. It then looks at the camera, then away. It then walks off-screen. (audio: sounds of clicks)"
 },
 {
  "timecode": "04:30",
  "caption": "A bobcat walks from the left side of the screen and into the center. It then looks at the camera, then away. It then walks off-screen. (audio: sounds of clicks)"
 },
 {
  "timecode": "04:50",
  "caption": "A bobcat walks from the left side of the screen and into the center. It then looks away, then walks off-screen. (audio: sounds of clicks)"
 },
 {
  "timecode": "04:57",
  "caption": "A mountain lion is walking from the left and into the center. It then looks at the camera, then walks from the left to the right. (audio: sounds of animals and clicks)"
 }
]

The prompt used is quite a generic one, but you can get even better results if you cutomize it to your needs (like asking specifically for foxes).

The live demo on AI Studio shows how you can postprocess this output to jump directly to the the specific part of the video by clicking on the timecodes. If you are interested, you can check the code of that demo on Github.

Extract and organize text

Gemini models can also read what’s in the video and extract it in an organized way. You can even use Gemini reasoning capabilities to generate new ideas for you.

const post_its_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?",
    google.createPartFromUri(post_its_video.uri ?? "", post_its_video.mimeType ?? ""),
  ],
});
tslab.display.markdown(post_its_video_analysis.text ?? "");

Okay, I’ve transcribed and organized the sticky notes into a table based on apparent themes.

Brainstorm: Project Name

Astronomy/Space Mythology/Legend Science/Mathematics Other
Andromeda’s Reach Aether Bayes Theorem Convergence
Canis Major Athena Chaos Field Orion’s Sword
Celestial Drift Athena’s Eye Chaos Theory
Centaurus Cerberus Equilibrium
Comets Tail Chimera Dream Euler’s Path
Delphinus Echo Fractal
Draco Hera Golden Ratio
Galactic Core Medusa Infinity Loop
Leo Minor Odin Riemann’s Hypothesis
Lunar Eclipse Pandora’s Box Stokes Theorem
Lyra Perseus Shield Symmetry
Lynx Phoenix Taylor Series
Orion’s Belt Prometheus Rising Vector
Sagitta Titan
Serpens Zephyr
Stellar Nexus
Supernova Echo

A Few More Ideas (following similar themes):

  1. Astronomy/Space:
    • Nebula Prime: Combines a celestial body with a sense of origin or importance.
    • Event Horizon: Suggests a point of no return, a boundary, or a pivotal moment.
    • Cosmic Anomaly: Implies uniqueness, discovery, or something out of the ordinary.
  2. Mythology/Legend:
    • Ares Ascent: Implies power, rising to a challenge, or a new beginning.
    • Midas Core: Suggests a central element that transforms or creates value.
    • Valhalla Gateway: Evokes a grand, important entry point or destination.
  3. Science/Mathematics Concepts:
    • Quantum Leap: Denotes a significant, sudden advancement or change.
    • Paradox Engine: Suggests a powerful system that resolves complex problems or defies expectations.
    • Fibonacci Sequence: Implies growth, natural patterns, and interconnectedness.
  4. General/Abstract:
    • Apex Protocol: Suggests a highest standard, a definitive procedure, or a peak achievement.
    • Vanguard Initiative: Implies leading the way, pioneering, or being at the forefront.
    • Syntropy Engine: (Syntropy is the opposite of entropy, tending towards order and organization) – Suggests bringing things together, creating order, or a powerful organizing force.

Structure information

Gemini 2.0 is not only able to read text but also to reason and structure about real world objects. Like in this video about a display of ceramics with handwritten prices and notes.

const pottery_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Give me a detailed table of my items and notes",
    google.createPartFromUri(pottery_video.uri ?? "", pottery_video.mimeType ?? ""),
  ],
  config: {
    systemInstruction: "Don't forget to escape the dollar signs",
  },
});
tslab.display.markdown(pottery_video_analysis.text ?? "");

Here is a detailed table of the items and notes from your image:

Item Type Quantity Description / Notes Dimensions Price
Tumblers 7 Glaze: #5 Artichoke double dip Approximately 4”h x 3”d $20 each
Small Bowls 2 Glaze: Brown speckled with dark interior accents 3.5”h x 6.5”d $35 each
Medium Bowls 2 Glaze: Brown speckled with dark interior accents 4”h x 7”d $40 each
Glaze Test Tile 1 Glaze: #5 Artichoke double dip Small, flat, irregular shape N/A (sample)
Glaze Test Tile 1 Glaze: #6 Gemini double dip (labeled 6b on tile) Small, flat, rectangular shape N/A (sample)
Firing Note: SLOW COOL

Summary of Items:

  • There are 7 tumblers, priced at $20 each, featuring an “#5 Artichoke double dip” glaze and measuring approximately 4 inches high by 3 inches in diameter.
  • There are 2 small bowls, priced at $35 each, measuring 3.5 inches high by 6.5 inches in diameter, with a brown speckled glaze.
  • There are 2 medium bowls, priced at $40 each, measuring 4 inches high by 7 inches in diameter, also with a brown speckled glaze.
  • Two glaze test tiles are present: one for the “#5 Artichoke double dip” glaze and another for a “#6 Gemini double dip” glaze, noting “SLOW COOL” as a firing instruction for the latter.

Analyze screen recordings for key moments

You can also use the model to analyze screen recordings. Let’s say you’re doing user studies on how people use your product, so you end up with lots of screen recordings, like this one, that you have to manually comb through. With just one prompt, the model can describe all the actions in your video.

const user_study_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes.",
    google.createPartFromUri(user_study_video.uri ?? "", user_study_video.mimeType ?? ""),
  ],
});
tslab.display.markdown(user_study_video_analysis.text ?? "");

This video demonstrates the core functionalities of the “My Garden App,” showcasing a plant shopping interface where users can browse various plants with their prices and descriptions. [0:00-0:09] Users interact with the app by “Liking” plants, which changes the button color, and “Adding to Cart,” which provides a quick “Added!” confirmation. [0:09-0:25] The application effectively tracks these actions, as seen when navigating to the “Shopping Cart” tab which lists added items like Fern, Cactus, and Hibiscus with a running total. [0:30-0:33] Finally, the “Profile” section provides a summary of user activity, displaying the total number of liked plants and items currently in the cart. [0:33-0:35]

Analyze youtube videos

On top of using your own videos you can also ask Gemini to get a video from Youtube and analyze it. He’s an example using the keynote from Google IO 2023. Guess what the main theme was?

const youtube_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    'Find all the instances where Sundar says "AI". Provide timestamps and broader context for each instance.',
    google.createPartFromUri("https://www.youtube.com/watch?v=ixRanV-rdAQ", "video/x-youtube"),
  ],
});
tslab.display.markdown(youtube_video_analysis.text ?? "");

Here are all the instances where Sundar Pichai says “AI”, along with their timestamps and broader context:

  1. 0:29 “As you may have heard, AI is having a very busy year.”
    • Context: Sundar is opening his keynote speech at Google I/O, remarking on the current surge in AI activity.
  2. 0:39 “Seven years into our journey as an AI first company, we are at an exciting inflection point.”
    • Context: He’s emphasizing Google’s long-term commitment to AI, stating that it’s been foundational to their operations for seven years and they’re now at a crucial stage.
  3. 0:45 “We have an opportunity to make AI even more helpful for people, for businesses, for communities, for everyone.”
    • Context: He’s highlighting the broad potential of AI to benefit various aspects of society and user groups.
  4. 0:54 “We’ve been applying AI to make our products radically more helpful for a while.”
    • Context: He mentions Google’s history of integrating AI into its products for enhanced utility.
  5. 0:59 “With generative AI, we are taking the next step.”
    • Context: He’s introducing the focus on generative AI and how it represents a significant leap forward for Google’s product development.
  6. 1:17 “Let me start with few examples of how generative AI is helping to evolve our products, starting with Gmail.”
    • Context: He transitions to showcasing specific product examples, beginning with Gmail, that demonstrate the evolution driven by generative AI.
  7. 1:41 “Smart Compose led to more advanced writing features powered by AI.”
    • Context: He’s explaining the progression of AI-powered features in Workspace, from Smart Reply to Smart Compose, and their widespread usage.
  8. 3:03 “Since the early days of Street View, AI has stitched together billions of panoramic images, so people can explore the world from their device.”
    • Context: He’s talking about Google Maps and how AI has been instrumental in creating immersive experiences like Street View.
  9. 3:14 “At last year’s I/O, we introduced Immersive View, which uses AI to create a high fidelity representation of a place, so you can experience it before you visit.”
    • Context: He describes how AI is used in Immersive View to provide realistic virtual tours of locations in Maps.
  10. 5:08 “Another product made better by AI is Google Photos.”
    • Context: He introduces Google Photos as another example of an AI-enhanced product.
  11. 5:16 “It was one of our first AI native products.”
    • Context: He specifies Google Photos’ early integration of AI from its inception.
  12. 5:40AI advancements give us more powerful ways to do this.”
    • Context: He’s talking about how new AI advancements are enabling more sophisticated photo editing capabilities in Google Photos.
  13. 5:48 “Magic Eraser, launched first on Pixel, uses AI-powered computational photography to remove unwanted distractions.”
    • Context: He explains the technology behind Magic Eraser in Google Photos, highlighting its AI foundation.
  14. 7:43 “These are just a few examples of how AI can help you in moments that matter.”
    • Context: He summarizes the product demonstrations, emphasizing the practical and meaningful applications of AI.
  15. 7:49 “And there is so much more we can do to deliver the full potential of AI across the products you know and love.”
    • Context: He reaffirms Google’s ongoing commitment to expanding AI capabilities across its entire product ecosystem.
  16. 8:24 “Making AI helpful for everyone is the most profound way we will advance our mission.”
    • Context: He states Google’s overarching goal for AI: to make it universally beneficial.
  17. 8:53 “And finally, by building and deploying AI responsibly, so that everyone can benefit equally.”
    • Context: He outlines the four key ways Google is making AI helpful for everyone, with responsible deployment being the last point.
  18. 9:03 “Our ability to make AI helpful for everyone relies on continuously advancing our foundation models.”
    • Context: He connects the goal of helpful AI to the foundational research and development of AI models.
  19. 10:25 “It’s also trained on multilingual text spanning over 100 languages, so it understands and generates nuanced results. Combined with powerful coding capabilities, PaLM 2 can also help developers collaborating around the world.”
    • Context: He’s explaining the advanced training and capabilities of PaLM 2, Google’s latest large language model, which is an AI.
  20. 11:27 “It uses AI to better detect malicious scripts and can help security experts understand and resolve threats.”
    • Context: He’s detailing how Sec-PaLM, an AI model, is being used for cybersecurity.
  21. 12:47 “PaLM 2 is the latest step in our decade-long journey to bring AI in responsible ways to billions of people.”
    • Context: He reiterates the long-term vision of making AI accessible and beneficial to a global audience.
  22. 12:53 “It builds on progress made by two world-class teams, the Brain Team and DeepMind. Looking back at the defining AI breakthroughs over the last decade, these teams have contributed to a significant number of them.”
    • Context: He acknowledges the history of Google’s AI research and the key teams that have driven breakthroughs.
  23. 13:10 “All this helps set the stage for the inflection point we are at today. We recently brought these two teams together into a single unit, Google DeepMind, using the computational resources of Google, they are focused on building more capable systems safely and responsibly. This includes our next generation foundation model, Gemini, which is still in training. Gemini was created from the ground up to be multimodal, highly efficient at tool and API integrations, and built to enable future innovations like memory and planning. While still early, we are already seeing impressive multimodal capabilities not seen in prior models. Once fine-tuned and rigorously tested for safety, Gemini will be available at various sizes and capabilities, just like PaLM 2. As we invest in more advanced models, we are also deeply investing in AI responsibility.”
    • Context: Sundar discusses the merger of Google Brain and DeepMind into Google DeepMind, their focus on building more capable AI systems, and the ongoing development of Gemini. He emphasizes the commitment to responsible AI development.
  24. 15:11 “James will talk about our responsible approach to AI later.”
    • Context: He refers to a segment by another speaker who will delve deeper into Google’s approach to AI ethics and responsibility.
  25. 15:18 “As models get better and more capable, one of the most exciting opportunities is making them available for people to engage with directly. That’s the opportunity we have with Bard, our experiment for conversational AI.”
    • Context: He speaks about the increasing capabilities of AI models and the opportunity to make them directly available to users through interfaces like Bard.

Customizing video preprocessing

The Gemini API allows you to define some preprocessing steps to enhance your abilities to understand and extract information from videos.

You can use clipping intervals (or define time offsets to focus on specific video parts) and custom FPS (to define how many frames will be considered to analyze the video.

For more details about those features, you can take a look at the Customizing video preprocessing at the Gemini API documentation.

Analyze specific parts of videos using clipping intervals

Sometimes you want to look for specific parts of your videos. You can define time offsets on your request, pointing to the model which specific video interval you are more interested about.

Note

The VideoMetadata that you will inform must be representing the time offsets in seconds.

In this example, you are using this video, from Google I/O 2024 keynote and asking the model to consider specifically the time offset between 1min40s and 5min.

import { VideoMetadata } from "@google/genai";

const video_metadata: VideoMetadata = {
  startOffset: "100s",
  endOffset: "300s",
};

const clipping_intervals_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    {
      ...google.createPartFromUri("https://www.youtube.com/watch?v=WsEQjeZoEng", "video/x-youtube"),
      videoMetadata: video_metadata,
    },
    "Please summarize the video in 3 sentences",
  ],
});
tslab.display.markdown(clipping_intervals_video_analysis.text ?? "");

Google I/O 2024 showcased the deep integration of its Gemini AI across its vast product ecosystem. Major announcements included the expanded multimodal capabilities and significantly larger context windows of Gemini 1.5 Pro, the real-time AI agent “Project Astra,” and the efficient Gemini 1.5 Flash. These advancements aim to deliver more intuitive and powerful AI experiences across Google Workspace, Search, Android, and creative tools, alongside a commitment to responsible AI development.

import { VideoMetadata } from "@google/genai";

const video_metadata_2: VideoMetadata = {
  startOffset: "60s",
  endOffset: "120s",
};

const clipping_intervals_video_analysis_2 = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    {
      ...google.createPartFromUri(trailcam_video.uri ?? "", trailcam_video.mimeType ?? ""),
      videoMetadata: video_metadata_2,
    },
    "Summarize this video in few short bullets with timestamps.",
  ],
});
tslab.display.markdown(clipping_intervals_video_analysis_2.text ?? "");

Here’s a summary of the video:

  • 0:00 - 0:04: A bobcat quickly crosses the frame, its eyes reflecting the camera’s light.
  • 0:05 - 0:09: Another bobcat quickly crosses the frame.
  • 0:10 - 0:13: A third bobcat quickly crosses the frame.
  • 0:14 - 0:17: A fourth bobcat quickly crosses the frame.
  • 0:20 - 0:29: Two large mountain lions appear, walking past the camera.
  • 0:35 - 0:39: A bobcat quickly crosses the frame.
  • 0:40 - 0:44: Another bobcat quickly crosses the frame.
  • 0:45 - 0:49: Another bobcat quickly crosses the frame.
  • 0:50 - 0:54: A mountain lion quickly passes by.
  • 1:00 - 1:03: A bobcat approaches the camera, its eyes glowing, and then quickly retreats.
  • 1:04 - 1:16: A mountain lion walks into the frame, pauses, and then continues on its way.
  • 1:17 - 1:28: Two mountain lions appear, one walking past the camera closer to the foreground, followed by the other.
  • 1:29 - 1:51: A bobcat pauses in the frame, sniffing the ground and looking around before exiting.
  • 1:51 - 1:56: A large black bear walks slowly past the camera.
  • 1:57 - 2:02: A mountain lion walks quickly past the camera.

Customize the number of video frames per second (FPS) analyzed

By default, the Gemini API extract 1 (one) FPS to analyze your videos. But this amount may be too much (for videos with less activities, like a lecture) or to preserve more detail in fast-changing visuals, a higher FPS should be selected.

In this scenario, you are using one specific interval of one Nascar pit-stop as also you will capture a higher number of FPS (in this case, 24 FPS).

import { VideoMetadata } from "@google/genai";

const video_metadata_3: VideoMetadata = {
  startOffset: "15s",
  endOffset: "35s",
  fps: 24,
};

const fps_video_analysis = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    {
      ...google.createPartFromUri("https://www.youtube.com/watch?v=McN0-DpyHzE", "video/x-youtube"),
      videoMetadata: video_metadata_3,
    },
    "How many tires where changed? Front tires or rear tires?",
  ],
});
tslab.display.markdown(fps_video_analysis.text ?? "");

The pit crew changed all four tires on the car, which includes both the front and rear tires.

Once again, you can check the live demo on AI Studio shows an example on how to postprocess this output. Check the code of that demo for more details.

Next Steps

Try with you own videos using the AI Studio’s live demo or play with the examples from this notebook (in case you haven’t seen, there are other prompts you can try in the dropdowns).

For more examples of the Gemini capabilities, check the other guide from the Cookbook. You’ll learn how to use the Live API, juggle with multiple tools or use Gemini 2.0 spatial understanding abilities.

The examples folder from the cookbook is also full of nice code samples illustrating creative ways to use Gemini multimodal capabilities and long-context.