2D spatial understanding with Gemini 2.0

This notebook introduces object detection and spatial understanding with the Gemini API like in the Spatial understanding example from AI Studio and demonstrated in the Building with Gemini 2.0: Spatial understanding video.

You’ll learn how to use Gemini the same way as in the demo and perform object detection like this:

There are many examples, including object detection with

simply overlaying information
searching within an image
translating and understanding things in multiple languages
using Gemini thinking abilities

Note

There’s no “magical prompt”. Feel free to experiment with different ones. You can use the dropdown to see different samples, but you can also write your own prompts. Also, you can try uploading your own images.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env

Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");

GEMINI_API_KEY is set in the environment variables

Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── quickstarts
    └── Spatial_understanding.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Select a model

Spatial understanding works best Gemini 2.0 Flash model. It’s even better with 2.5 models like gemini-2.5-pro-preview-05-20 but slightly slower as it’s a thinking model.

Some features, like segmentation, only works with 2.5 models.

You can try with the older ones but it might be more inconsistent (gemini-1.5-flash-001 had the best results of the previous generation).

For more information about all Gemini models, check the documentation for extended information on each of them.

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-05-20";

System instructions

With the new SDK, the systemInstruction and the model parameters must be passed in all generateContent calls, so let’s save them to not have to type them all the time.

const BOUNDING_BOX_SYSTEM_INSTRUCTION = `
    Return bounding boxes as a JSON array with labels. Never return masks or code fencing. Limit to 25 objects.
    If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
`;

The system instructions are mainly used to make the prompts shorter by not having to reapeat each time the format. They are also telling the model how to deal with similar objects which is a nice way to let it be creative.

The Spatial understanding example is using a different strategy with no system instructions but a longer prompt. You can see their full prompts by clicking on the “show raw prompt” button on the right. There no optimal solution, experiment with diffrent strategies and find the one that suits your use-case the best.

Utilities

Some scripts will be needed to draw the bounding boxes. Of course they are just examples and you are free to just write your own.

For example the Spatial understanding example from AI Studio uses HML to render the bounding boxes. You can find its code in the Github repository.

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");

const downloadFile = async (url: string, filePath: string) => {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error(`Failed to download file: ${response.statusText}`);
  }
  const buffer = await response.blob();
  const bufferData = Buffer.from(await buffer.arrayBuffer());
  fs.writeFileSync(filePath, bufferData);
};

const canvas = require("canvas") as typeof import("canvas");

const colors = [
  "red",
  "green",
  "blue",
  "yellow",
  "orange",
  "pink",
  "purple",
  "brown",
  "gray",
  "beige",
  "turquoise",
  "cyan",
  "magenta",
  "lime",
  "navy",
  "maroon",
  "teal",
  "olive",
  "coral",
  "lavender",
  "violet",
  "gold",
  "silver",
];

interface BoundingBox {
  box_2d: [number, number, number, number]; // [y1, x1, y2, x2] in 1000ths of the image dimensions
  label: string;
}

// eslint-disable-next-line @typescript-eslint/no-unnecessary-type-parameters
function parseJSON<T = unknown>(jsonString: string): T {
  const match = /^```(?:json)?\\s*([\\s\\S]*?)\\s*```$/i.exec(jsonString.trim());
  const jsonContent = match ? match[1].trim() : jsonString.trim();
  try {
    return JSON.parse(jsonContent) as T;
  } catch (err) {
    throw new Error("Invalid JSON input");
  }
}

function normalizeBox(
  box: [number, number, number, number],
  width: number,
  height: number
): { x1: number; y1: number; x2: number; y2: number } {
  const [y1, x1, y2, x2] = box;
  let absY1 = Math.floor((y1 / 1000) * height);
  let absX1 = Math.floor((x1 / 1000) * width);
  let absY2 = Math.floor((y2 / 1000) * height);
  let absX2 = Math.floor((x2 / 1000) * width);

  if (absX1 > absX2) [absX1, absX2] = [absX2, absX1];
  if (absY1 > absY2) [absY1, absY2] = [absY2, absY1];

  return { x1: absX1, y1: absY1, x2: absX2, y2: absY2 };
}

async function plotBoundingBoxes(imagePath: string, boundingBoxesJson: string): Promise<void> {
  const img = await canvas.loadImage(imagePath);
  const { width, height } = img;

  const image_canvas = canvas.createCanvas(width, height);
  const ctx = image_canvas.getContext("2d");

  ctx.drawImage(img, 0, 0, width, height);
  ctx.lineWidth = 4;

  const textSize = Math.min(width, height) / 64;
  ctx.font = `bold ${textSize}px Arial`;

  const boundingBoxes = parseJSON<BoundingBox[]>(boundingBoxesJson);

  for (let i = 0; i < boundingBoxes.length; i++) {
    const bbox = boundingBoxes[i];
    const color = colors[i % colors.length];
    const { x1, y1, x2, y2 } = normalizeBox(bbox.box_2d, width, height);

    ctx.strokeStyle = color;
    ctx.strokeRect(x1, y1, x2 - x1, y2 - y1);

    if (bbox.label) {
      ctx.fillStyle = color;
      ctx.fillText(bbox.label, x1 + 8, y1 - 8);
    }
  }

  const buffer = image_canvas.toBuffer();
  const uint8Array = new Uint8Array(buffer);
  tslab.display.png(uint8Array);
}

Get example images

const IMAGE_MAP = {
  "Socks.jpg": "https://storage.googleapis.com/generativeai-downloads/images/socks.jpg",
  "Vegetables.jpg": "https://storage.googleapis.com/generativeai-downloads/images/vegetables.jpg",
  "Japanese_bento.png": "https://storage.googleapis.com/generativeai-downloads/images/Japanese_Bento.png",
  "Cupcakes.jpg": "https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg",
  "Origamis.jpg": "https://storage.googleapis.com/generativeai-downloads/images/origamis.jpg",
  "Fruits.jpg": "https://storage.googleapis.com/generativeai-downloads/images/fruits.jpg",
  "Cat.jpg": "https://storage.googleapis.com/generativeai-downloads/images/cat.jpg",
  "Pumpkins.jpg": "https://storage.googleapis.com/generativeai-downloads/images/pumpkins.jpg",
  "Breakfast.jpg": "https://storage.googleapis.com/generativeai-downloads/images/breakfast.jpg",
  "Bookshelf.jpg": "https://storage.googleapis.com/generativeai-downloads/images/bookshelf.jpg",
  "Spill.jpg": "https://storage.googleapis.com/generativeai-downloads/images/spill.jpg",
};
for (const [fileName, url] of Object.entries(IMAGE_MAP)) {
  const filePath = path.join("../assets/spatial_understanding", fileName);
  if (!fs.existsSync(filePath)) {
    fs.mkdirSync(path.dirname(filePath), { recursive: true });
    await downloadFile(url, filePath);
    console.log(`Downloaded ${fileName} to ${filePath}`);
  } else {
    console.log(`${fileName} already exists at ${filePath}`);
  }
}

Socks.jpg already exists at ../assets/spatial_understanding/Socks.jpg
Vegetables.jpg already exists at ../assets/spatial_understanding/Vegetables.jpg
Japanese_bento.png already exists at ../assets/spatial_understanding/Japanese_bento.png
Cupcakes.jpg already exists at ../assets/spatial_understanding/Cupcakes.jpg
Origamis.jpg already exists at ../assets/spatial_understanding/Origamis.jpg
Fruits.jpg already exists at ../assets/spatial_understanding/Fruits.jpg
Cat.jpg already exists at ../assets/spatial_understanding/Cat.jpg
Pumpkins.jpg already exists at ../assets/spatial_understanding/Pumpkins.jpg
Breakfast.jpg already exists at ../assets/spatial_understanding/Breakfast.jpg
Bookshelf.jpg already exists at ../assets/spatial_understanding/Bookshelf.jpg
Spill.jpg already exists at ../assets/spatial_understanding/Spill.jpg

Overlaying Information

Let’s start by loading an image, the cupcakes one for example:

const CUPCAKE_IMAGE_PATH = path.join("../assets/spatial_understanding/Cupcakes.jpg");
tslab.display.jpeg(fs.readFileSync(CUPCAKE_IMAGE_PATH));

Let’s start with a simple prompt to find all items in the image.

To prevent the model from repeating itself, it is recommended to use a temperature over 0, in this case 0.5. Limiting the number of items (25 in the system instructions) is also a way to prevent the model from looping and to speed up the decoding of the bounding boxes. You can experiment with these parameters and find what works best for your use-case.

const boundingBoxesResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Detect the 2d bounding boxes of the cupcakes (with “label” as topping description”)",
    google.createPartFromBase64(fs.readFileSync(CUPCAKE_IMAGE_PATH).toString("base64"), "image/jpeg"),
  ],
  config: {
    systemInstruction: BOUNDING_BOX_SYSTEM_INSTRUCTION,
    temperature: 0.5,
  },
});
console.log(boundingBoxesResponse.text);

```json
[
  {"box_2d": [389, 64, 570, 201], "label": "Dark chocolate cupcake with red and pink sprinkles"},
  {"box_2d": [383, 250, 538, 367], "label": "Light pink frosting with small pink sprinkles"},
  {"box_2d": [366, 397, 501, 506], "label": "Pink frosting with light blue candy balls"},
  {"box_2d": [353, 529, 521, 650], "label": "Pink frosting with blue candy balls"},
  {"box_2d": [384, 737, 535, 865], "label": "Plain chocolate frosting cupcake"},
  {"box_2d": [476, 627, 636, 768], "label": "White frosting with red, yellow, blue, and pink sprinkles (top-right)"},
  {"box_2d": [509, 799, 688, 959], "label": "White frosting with red, yellow, blue, and pink sprinkles (middle-right)"},
  {"box_2d": [554, 40, 723, 199], "label": "White frosting with mixed colorful sprinkles (front-left)"},
  {"box_2d": [546, 296, 700, 442], "label": "White frosting with candy eyes and yellow/red sprinkles (middle-left)"},
  {"box_2d": [442, 433, 595, 563], "label": "Bright pink frosting with candy eyes"},
  {"box_2d": [559, 514, 712, 663], "label": "White frosting with candy eyes and yellow/red sprinkles (middle-right)"},
  {"box_2d": [743, 134, 919, 306], "label": "White frosting with two candy eyes (bottom-left)"},
  {"box_2d": [655, 354, 815, 513], "label": "White frosting with two candy eyes (bottom-right)"}
]
```

As you can see, even without any instructions about the format, Gemini is trained to always use this format with a label and the coordinates of the bounding box in a “box_2d” array.

Just be careful, the y coordinates are first, x ones afterwards contrary to common usage.

await plotBoundingBoxes(CUPCAKE_IMAGE_PATH, boundingBoxesResponse.text ?? "[]");

Search within an image

Let’s complicate things and search within the image for specific objects.

const SOCKS_IMAGE_PATH = path.join("../assets/spatial_understanding/Socks.jpg");

const socksBoundingBoxesResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Show me the positions of the socks with the face",
    google.createPartFromBase64(fs.readFileSync(SOCKS_IMAGE_PATH).toString("base64"), "image/jpeg"),
  ],
  config: {
    systemInstruction: BOUNDING_BOX_SYSTEM_INSTRUCTION,
    temperature: 0.5,
  },
});
console.log(socksBoundingBoxesResponse.text);

[
  {"box_2d": [240, 650, 649, 860], "label": "socks with the face"},
  {"box_2d": [53, 248, 384, 524], "label": "socks with the face"}
]

await plotBoundingBoxes(SOCKS_IMAGE_PATH, socksBoundingBoxesResponse.text ?? "[]");

Try it with different images and prompts. Different samples are proposed but you can also write your own.

Multilinguality

As Gemini is able to understand multiple languages, you can combine spatial reasoning with multilingual capabilities.

You can give it an image like this and prompt it to label each item with Japanese characters and English translation. The model reads the text and recognize the pictures from the image itself and translates them.

const JAPANESE_BENTO_IMAGE_PATH = path.join("../assets/spatial_understanding/Japanese_bento.png");

const japaneseBentoBoundingBoxesResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Detect food, label them with Japanese characters + english translation.",
    google.createPartFromBase64(fs.readFileSync(JAPANESE_BENTO_IMAGE_PATH).toString("base64"), "image/png"),
  ],
  config: {
    systemInstruction: BOUNDING_BOX_SYSTEM_INSTRUCTION,
    temperature: 0.5,
  },
});
console.log(japaneseBentoBoundingBoxesResponse.text);

```json
[
  {"box_2d": [63, 56, 237, 259], "label": "抹茶のスイスロール (Matcha Swiss Roll)"},
  {"box_2d": [62, 363, 225, 595], "label": "あんパン (Anpan)"},
  {"box_2d": [55, 683, 243, 912], "label": "さきイカ (Sakiika)"},
  {"box_2d": [379, 31, 537, 287], "label": "梅干し (Umeboshi)"},
  {"box_2d": [372, 361, 555, 595], "label": "たい焼き (Taiyaki)"},
  {"box_2d": [361, 674, 555, 920], "label": "あずき最中 (Azuki Monaka)"},
  {"box_2d": [727, 24, 870, 305], "label": "お握り (Onigiri)"},
  {"box_2d": [705, 358, 864, 602], "label": "桜餅 (Sakuramochi)"},
  {"box_2d": [702, 670, 856, 918], "label": "海苔巻煎餅 (Norimaki Senbei)"}
]
```

await plotBoundingBoxes(JAPANESE_BENTO_IMAGE_PATH, japaneseBentoBoundingBoxesResponse.text ?? "[]");

Use Gemini reasoning capabilities

The model can also reason based on the image, you can ask it about the positions of items, their utility, or, like in this example, to find the shadow of a speficic item.

const ORIGAMI_IMAGE_PATH = path.join("../assets/spatial_understanding/Origamis.jpg");

const origamiBoundingBoxesResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Draw a square around the fox' shadow",
    google.createPartFromBase64(fs.readFileSync(ORIGAMI_IMAGE_PATH).toString("base64"), "image/jpeg"),
  ],
  config: {
    systemInstruction: BOUNDING_BOX_SYSTEM_INSTRUCTION,
    temperature: 0.5,
  },
});
console.log(origamiBoundingBoxesResponse.text);

```json
[
  {"box_2d": [435, 256, 608, 411], "label": "fox' shadow"}
]
```

await plotBoundingBoxes(ORIGAMI_IMAGE_PATH, origamiBoundingBoxesResponse.text ?? "[]");

You can also use Gemini knowledge to enhanced the labels returned. In this example Gemini will give you advices on how to fix your little mistake.

const SPILL_IMAGE_PATH = path.join("../assets/spatial_understanding/Spill.jpg");

const spillBoundingBoxesResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    "Tell me how to clean my table with an explanation as label. Do not just label the items",
    google.createPartFromBase64(fs.readFileSync(SPILL_IMAGE_PATH).toString("base64"), "image/jpeg"),
  ],
  config: {
    systemInstruction: BOUNDING_BOX_SYSTEM_INSTRUCTION,
    temperature: 0.5,
  },
});
console.log(spillBoundingBoxesResponse.text);

```json
[
  {"box_2d": [447, 0, 1000, 417], "label": "cleaning cloth: Use this cloth to gently blot and wipe the spilled liquid from the table surface."},
  {"box_2d": [432, 606, 663, 874], "label": "liquid spill: This is the substance that needs to be cleaned from the table."},
  {"box_2d": [200, 299, 508, 517], "label": "sugar bowl: Move this item aside to ensure thorough cleaning of the table surface underneath and around it."},
  {"box_2d": [106, 556, 220, 699], "label": "sugar container: Move this item aside to ensure thorough cleaning of the table surface underneath and around it."},
  {"box_2d": [287, 140, 467, 314], "label": "knives: Carefully move these sharp objects to a safe location before cleaning to prevent injury and allow full access to the table surface."},
  {"box_2d": [108, 843, 360, 1000], "label": "wooden object: Move this item aside to ensure thorough cleaning of the table surface underneath and around it."},
  {"box_2d": [113, 224, 280, 492], "label": "wooden stand: Move this item aside to ensure thorough cleaning of the table surface underneath and around it."},
  {"box_2d": [121, 120, 361, 229], "label": "wooden dowel: This is part of the wooden stand; move it with the stand to clear the area for cleaning."}
]
```

await plotBoundingBoxes(SPILL_IMAGE_PATH, spillBoundingBoxesResponse.text ?? "[]");

And if you check the previous examples, the Japanese food one in particular, multiple other prompt samples are provided to experiment with Gemini reasoning capabilities.

Experimental: Segmentation

2.5 models are also able to segment the image and not only draw a bounding box but to also provide a mask of the contour of the items. It’s especially useful if you are planning on editing images like in the Virtual try-on example.

import { CanvasRenderingContext2D } from "canvas";

interface SegmentationMask {
  y0: number;
  x0: number;
  y1: number;
  x1: number;
  mask: Uint8ClampedArray; // flat array of grayscale pixels
  label: string;
}

interface Box2D {
  box_2d: [number, number, number, number]; // [y0, x0, y1, x1]
  label: string;
  mask: string; // base64 PNG
}

interface RGBColor {
  r: number;
  g: number;
  b: number;
}

const rgbColors: RGBColor[] = [
  { r: 255, g: 0, b: 0 }, // Red
  { r: 0, g: 255, b: 0 }, // Green
  { r: 0, g: 0, b: 255 }, // Blue
  { r: 255, g: 255, b: 0 }, // Yellow
  { r: 255, g: 165, b: 0 }, // Orange
  { r: 255, g: 192, b: 203 }, // Pink
  { r: 128, g: 0, b: 128 }, // Purple
  { r: 165, g: 42, b: 42 }, // Brown
  { r: 128, g: 128, b: 128 }, // Gray
  { r: 245, g: 245, b: 220 }, // Beige
  { r: 64, g: 224, b: 208 }, // Turquoise
  { r: 0, g: 255, b: 255 }, // Cyan
  { r: 255, g: 0, b: 255 }, // Magenta
  { r: 50, g: 205, b: 50 }, // Lime
  { r: 0, g: 0, b: 128 }, // Navy
  { r: 128, g: 0, b: 0 }, // Maroon
  { r: 0, g: 128, b: 128 }, // Teal
  { r: 128, g: 128, b: 0 }, // Olive
  { r: 255, g: 127, b: 80 }, // Coral
  { r: 230, g: 230, b: 250 }, // Lavender
  { r: 238, g: 130, b: 238 }, // Violet
  { r: 255, g: 215, b: 0 }, // Gold
  { r: 192, g: 192, b: 192 }, // Silver
];

async function parseSegmentationMasks(
  predictedStr: string,
  imgHeight: number,
  imgWidth: number
): Promise<SegmentationMask[]> {
  const items = parseJSON<Box2D[]>(predictedStr);
  const masks: SegmentationMask[] = [];

  for (const item of items) {
    const [normY0, normX0, normY1, normX1] = item.box_2d;
    const y0 = Math.floor((normY0 / 1000) * imgHeight);
    const x0 = Math.floor((normX0 / 1000) * imgWidth);
    const y1 = Math.floor((normY1 / 1000) * imgHeight);
    const x1 = Math.floor((normX1 / 1000) * imgWidth);

    if (y0 >= y1 || x0 >= x1) {
      console.warn("Invalid bounding box:", item.box_2d);
      continue;
    }

    let pngData = item.mask;
    if (!pngData.startsWith("data:image/png;base64,")) {
      console.warn("Invalid mask format");
      continue;
    }

    try {
      pngData = pngData.replace("data:image/png;base64,", "");
      const pngBuffer = Buffer.from(pngData, "base64");
      const maskImage = await canvas.loadImage(pngBuffer);

      const bboxWidth = x1 - x0;
      const bboxHeight = y1 - y0;

      const tempCanvas = canvas.createCanvas(bboxWidth, bboxHeight);
      const ctx = tempCanvas.getContext("2d");
      ctx.drawImage(maskImage, 0, 0, bboxWidth, bboxHeight);

      const imageData = ctx.getImageData(0, 0, bboxWidth, bboxHeight);
      const gray = new Uint8ClampedArray(bboxWidth * bboxHeight);

      for (let i = 0; i < bboxWidth * bboxHeight; i++) {
        const idx = i * 4;
        // Use red channel as mask value (PNG grayscale mask is usually replicated across channels)
        gray[i] = imageData.data[idx];
      }

      masks.push({
        y0,
        x0,
        y1,
        x1,
        mask: gray,
        label: item.label,
      });
    } catch (err) {
      console.error("Error parsing mask:", err);
      continue;
    }
  }
  return masks;
}

function overlayMaskOnImage(
  ctx: CanvasRenderingContext2D,
  mask: Uint8ClampedArray,
  imgWidth: number,
  imgHeight: number,
  x0: number,
  y0: number,
  x1: number,
  y1: number,
  color: RGBColor,
  alpha = 0.6
) {
  const width = x1 - x0;
  const height = y1 - y0;
  const imageData = ctx.getImageData(0, 0, imgWidth, imgHeight);
  const [r, g, b] = [color.r, color.g, color.b];

  for (let y = 0; y < height; y++) {
    for (let x = 0; x < width; x++) {
      const idx = y * width + x;
      if (mask[idx] > 127) {
        const px = (imgWidth * (y0 + y) + (x0 + x)) * 4;
        imageData.data[px] = r * alpha + imageData.data[px] * (1 - alpha);
        imageData.data[px + 1] = g * alpha + imageData.data[px + 1] * (1 - alpha);
        imageData.data[px + 2] = b * alpha + imageData.data[px + 2] * (1 - alpha);
      }
    }
  }
  ctx.putImageData(imageData, 0, 0);
}

async function plotSegmentationMasks(imageBuffer: Buffer, segmentationMasks: SegmentationMask[]): Promise<void> {
  const baseImg = await canvas.loadImage(imageBuffer);
  const image_canvas = canvas.createCanvas(baseImg.width, baseImg.height);
  const ctx = image_canvas.getContext("2d");

  ctx.drawImage(baseImg, 0, 0);

  segmentationMasks.forEach((mask, i) => {
    const color = rgbColors[i % rgbColors.length];
    overlayMaskOnImage(ctx, mask.mask, baseImg.width, baseImg.height, mask.x0, mask.y0, mask.x1, mask.y1, color);
  });

  ctx.lineWidth = 4;
  ctx.font = "bold 14px sans-serif";

  segmentationMasks.forEach((mask, i) => {
    const color = rgbColors[i % rgbColors.length];
    ctx.strokeStyle = `rgba(${color.r}, ${color.g}, ${color.b}, 0.8)`;
    ctx.strokeRect(mask.x0, mask.y0, mask.x1 - mask.x0, mask.y1 - mask.y0);
  });

  segmentationMasks.forEach((mask, i) => {
    const color = rgbColors[i % rgbColors.length];
    if (mask.label) {
      ctx.fillStyle = `rgba(${color.r}, ${color.g}, ${color.b}, 0.8)`;
      ctx.fillText(mask.label, mask.x0 + 8, mask.y0 - 6);
    }
  });

  const buffer = image_canvas.toBuffer();
  tslab.display.png(new Uint8Array(buffer));
}

const SEGMENTATION_PROMPT =
  'Give the segmentation masks for the metal, wooden and glass small items (ignore the table). Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key "box_2d", the segmentation mask in key "mask", and the text label in the key "label". Use descriptive labels.';

const segmentationResponse = await ai.models.generateContent({
  model: MODEL_ID,
  contents: [
    SEGMENTATION_PROMPT,
    google.createPartFromBase64(fs.readFileSync(CUPCAKE_IMAGE_PATH).toString("base64"), "image/jpeg"),
  ],
  config: {
    temperature: 0.5,
  },
});
console.log(segmentationResponse.text);

```json
[
  {"box_2d": [156, 395, 362, 579], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACMElEQVR42u3aS04DMRQEwL7/pc0GCRAIEmBsj7t6k2XclefJ/BLJ6G7/mnqAWoTxKfUAHzHePg89sI2nc6Mf9RGK8cvsWjzPrnX8Mftv5eRSgLUQ47EOMwDmKvzj0sZl2af8twsaF2dyyefXMiZneuufvn8szuwVbAewir1dYNQDvCMgQMCBsB6gEGGLs8LtAFoQllyI3Kb+8QZrL0jv0/9Ygx3uS9wIIPUA5xGUA2xze/I+/c8SaAdIugUCoFygHWDF40gCZ9Q/AWDVI/lD+qe9/80Flr6Wckj/GwssfzPpFIH2AQDQ3h8AAADd/U2AEQBQ3h8AAADlAAEAAAAAAAAAAKgFCAAAAAAAAFAMEAAAAAAAAACAEyEAdgAAW8AEAAAAAEBXfxNgAgAAAAAAAAAAAAAAcC0AwBYAAAAAAAAAAAAAAAAAAAAuhwHYAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMBJAAbABAAAAAAAAADOAwAAAAAAAACnAQAAAAAAwL8gAACOAQAAAAAAAAAAAAAAAAAAAACAUwUAAADgIOiucHH/HFvsRgBZmtLaiwGyUarLTwfIhqkuPwsgO6e8/uUC2T/d7a8ESKoBkm6AlAMk3QIpB4gB6BYIgG6AAOgWCAAAAKoFAAAA0C0AAEC5QIyAEQDQLZB2gbQLhEC7QFJOkJQTJOUESTlBTkh7/78YJOUEOSrt/X9hkJQT5Ni0938MITWpB/gaIb3Zrf8LMvdcGQiUcOYAAAAASUVORK5CYII=", "label": "metal pot"},
  {"box_2d": [197, 249, 340, 381], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACZ0lEQVR42u3dQW4CMRBE0br/pSuRYEUCGhgPY8//JWXBKt2PtkFI0Ikxxhhjtqa/UcFARr0P8977H6j7/wNv/2ZQdv/rjEGP6r8rPZlH9N+1BvkAgJkN9tTaCwjsqrVdnWBnqV1dYG+pXVlgRKn9KIu8gTkO4GSDkTV2V+btfXN9HZJpOn67rg7MPH2fBDAK4av19NBs+McHF3IywPnBA1QAAdDtCyAAvf/XAgIIABegA1QAAdACFUAAAcgCFYAt0MIFBBBAALRABYALCCCAAGiBCiAAW4AOUAEEEEAAAcACdIAKIIAAaAEBBBAADdDCBeoECCCAAF6CAmABKoAAAggggAACYAUEEEAANEAF8AgI4BFwAgQQwDvACRBAAAEE8GVAAAEEEEAAAQQQQAABBBBAAAH8QEAAAQSYpNL+WSc7DCDTP0WbtuteDeCj1c6XAfjabrLJADJquXu33hAzARyzm/B1d5N8Z+a8tZRTfG9sho3clO4fej3/JySmWUqP6n6OX1ZfZl83ov2vCmTOCADv/1sCiQBkgYQNEDhA2AAJGyBwgAQuQAcIHCACwAXoABFAALYAHSACwAUEEEAAtoAAAgjAFhBAAAG8BQUQwEvAERDAMyCAZ8AREEAALwFHwAkQwCMggAAwAQEEgAsIIABcII6AAJ4BAdACAggAFxCALiAAXSB0gdAFBBAALhC6gADrCgQOMGgKQhcIXUAAukDoAgLQBRI4gQB0gdAFQhcIXSB0gcAFcqUogAcIHiD0/t8XCF0ggROELpDrht7/NoHQBXL9PO8dAvCEIve1n1SA2+MYM3F+ANYGaX/NW7QBAAAAAElFTkSuQmCC", "label": "metal pot"},
  {"box_2d": [267, 617, 363, 709], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACMklEQVR42u3d0WrCQBQA0fn/n74F20ItfdBqkk1m5l3wHnbjmgjCV3OLQdj8Tj6+DGIeTA9w2QvEPJse4GoO83p6gHMrzHvjdBazUXqA00jMri14lJjj0gOsITHrpAc4RmNWTg+wh8WcqR/vVyqwxcqYy/TPU9Zcq++ZtADPbxE9AFcH+LxLoV0BD6yIMXb3iTHO9ACjB0hgugYkEEACZoGWgP48bP9G0PwJdBEIIIAAAgjAKqC/OxxAAHKBANoDPSdtC7QCxAAEEEAAZgECCEANQAABBBCAGIAA5AIByAEIIAA1AAEEkEAAAQQQQAABBOAUQH8SCiCAAPoYCKCbYgkEEEAPhwIIoEfkAfQzmQDaAwG0BToIBBBA3wcDCCCAbgwHEEAAAQQQQADm+fuhVAB6AAIIIIAAAjADYJ8/gAACCMANQAABBOAGIIAAAnAD9J/kAdgBCEA+fwAEEIAcgAACkAOgByCAAOQABBBAAG6AixIQQABugQACkAsEgJwggADkAtgFArALBCAXALkAdoEA7AIB2AWwC2AXwC5gBwC5AHaBAOwC2AWwC2AXCMAugF0gALsAdgHsAtgFsAsEYBfALvBWANzzn3MR6AUCsAtgF0AuAHIB7ALYBQKQC7BdegD0AOgB0AOcwAC7AAnYCdgnPcCyBOyYHmBJApATgJyAI7LPvxIBx6UHWIOAg9MDHE3AIukBDjNgsfQAexuwbvb5d0HgXNnnv7N4TeT2Wqr+6AOdv8gXM6IqCAAAAABJRU5ErkJggg==", "label": "metal mixing bowl"},
  {"box_2d": [170, 96, 361, 229], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACQUlEQVR42u3dwU7jUBQDUP//T3s0YjQbWJQU2iQ+XiC1K3x030sCSZvIqdKPn+2/l71jx0MZrPyYy0lp+oYMVv5yIkaLv34aevqs9v49g14ww9V/jKAXz3T5pwR6l2y3PyTQu2W8/jcBOg7QbgsUAIBpgAIAsC1gAgAAIAAAAAAnAgAAAAAAYFGg6wAmAAAAAAAA7AIYAOdBLgUATPd3g0jW+7tH7H4C7pLsuEDWZ+DIQ1Tbe8DdBLJOcPQ5wvVF8PcpbgDTSwBAM78JjAMUwLZAAWwLFIA/iiwLFMC2QO8Zd4zPXQQdAuiNs97/EYF2XGAdoN0WaLcF2nGBdpwAwLZAuw3Qjgt0XWAdoB0XAAAAwLYAgHUBAOsC6wC1BCwBAADsgkYAAACbgBEAYA0AAAAAAADHQQCWwJgAgABY3wX9hxSAJQAAAAAADgMAAAAAAMBxEAAAmwAAAAAAANgC8PQYAAAAAAAA4EzICMz2B2AJAAAAAAAAx0EAAAAAAAAAAAAAAOYAfMIwAAAAAAAAAAAAABcDAAAAAADAcRAAAJsAAEvABJiAIQDfPglgHSAAAGz3XxIA4IvYAQAAsCuQcYBkXGAdINkWSMYFMi6QjAtkXSDjAtkGyMNZ739PgYwD5JtZ75/1/ncTyLZAjmW9/10E8kTG699AIE9nvP61CfJTGa9/SYL8dLbbX8ggv5rl7ucmyOtywu8pzuszXf5Ek5B3Z7j6Ww1ytgxXfxVCrpb/v3g/03z1Xj+99/E6S3l12z+kRYVXexx0GAAAAABJRU5ErkJggg==", "label": "glass jar"},
  {"box_2d": [276, 804, 467, 937], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACOklEQVR42u3dwW6CYBRE4fP+Lz2NSZdd2BYMcM7sNc7nJYDCD5TzszfypLpvFX4UxE7N6/1frLLaV5+TXTDm7h+C2J2iLn84wu4be/8DBPaA2Pv/B2FPir3/Hwj2wMjr/0pgdoHZBTY5QQB2gdkF7AALIIAAAgigvWDHQQG0CTQBAXQ23AAEEEAAKoHpARZAAAGIARoAucACCMAtYAdYAAEEEEAAAQQQQEeCAQSQQOfDxv4J9Kvo7JcJ9t9gAAF0HNBesAEIoFOhfhCqf1uATmBygE0tsAVgBpgnAcj7B2DvH4C9/08CdoBNLmAHWBMQQJuAWWAByAUCCMANsAACCCAAs0AAAbgBNrlAAAG4AdYENAEBBBCAF2ABBBBAAAEEEEAAAQQQQAABBBBAALL+AfTHkB4APUAXSHS5dBfMd9dMN40IBbp9tjvI3Qb2lWRaTSiA1lRrVcEW1lQT9PThnkDuJsAuAHIC7AIgJ0AuAHIB7ALYBbALBGAXwC4AcgJwE0ACcgGQE4CbABJwE4BcgGNy38fycVj0AOgB0APckQAScBOAXAC7AMgJArALYBcIwC6AXQC7AHYB5AIgFwA5AcgJsAsgF+BD0QMQgB2AAOwA2PsHQAB2gCsS2AEIQC6AXQC5AMgFQE6AXQC5AMgFsAsgFwC5gB0A5AI0AXICOwABuAWQA4BbgADcAMgBwC1AAG4B3ADgBuB6kdfnc90vWf9TAFw56vJ/F3jvZdwj3x/2Md/oSTNBKafmC2rbiEkV4QAjAAAAAElFTkSuQmCC", "label": "glass jar"},
  {"box_2d": [260, 701, 431, 811], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACMUlEQVR42u3aW3bbMBTFUMx/0rerXf1s87AlhSRwZoAdypLsQGuLb/7shJDfKX9z3toXuBYAmwX2I7VLlD9nMOtPmn0Xwew6c/slAnPA1PGvE8xRc9d/m2EOnjz/c4I5f/b+DwVGMnv//wgmALfAjFtgApALBBCAG6AD0AHoAPQYoH4Srj+AAOoPoJtgAF0CHYDeBALoEgigb0NcANMBCEAtMAEEoAaYALoEAggggF4FehQOwAvQV4IByPv7XaRfxwMggAACCCCAboMBBNDLQAC9DwcQQAAByAAmgAACCCCAAAIIIIAAlAATQAABBBBAd8EAAgggAGV/AAEE4O7vnwM6AAEEEEAAAQQQQAABBBCAsD+AAAKoP4AAxAB9BgQQQAABBBBAAAEEEIASYAIIoI+AAAIIIIAAAggggAAC6FUggAACCKCbQAABBBBAAAH0GBBAAAEE0E2gA9ABOLg/AHt/AAHI+xUA/DvdAsDHO/sv/1n9yQB8cdLsIwF4ae76MwDg9fwDAHhv5vbdAcAMwFWT528KgBqAayfP3w0A3AC4AcANAG4BeT/3zV2/BQByAMwAPDB7/8IC4AZADoAbgCcnz18QANQC4AYAtwByAOQA4BYAtwByAJALIBdADgBuAZALIBcAuQBuAdaZvf9nBJALsNrs/Q8LsOTs/c8RgFqAtSfPv12AHWbvv5GAfaYHuIWA3SbPv5iATWfvvwqB/WfvfweBo2bv/64Dx08dv+d+AcLlqXz1z8t/AAAAAElFTkSuQmCC", "label": "glass jar"},
  {"box_2d": [203, 597, 276, 895], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACpklEQVR42u3dQXLbMBBE0bn/pTubbJKKY4sEOQP83weA0U8Qy6QtTdV/k98pZvJn6P15BgmbIIEL5MvQ+zMQ8n0UOBohPw8e4EyDfBp6/9MMcjF4gFMQci94gP0NsiJ4gI0Jsix4gE0Rsjr0/rsZ5KHgAbYhyKOh998AIYET5KXgAaYa5NXgAQYaJHCChE2QpuABxhAkcIK0psLuP+AgJHCCBE6QwAkCF8io4AGCBwi9vwDBA4TeXwA8QAQQQAB0fwEEEIDdX4AIIAC7vwARQACvAegDEAEEgAOE3l+ACCAAHCD0/gIIQAeIAAIIgO4vQAQQQAA2QOj9BcADRAAB4ABXBQI8AZ8ucNgJ+HyRHCNwcZWcIHBnmewvcG+Z7C1wf5lkW4LeVRoN1joGGDxA8ACh91dAgND7CxDfAr4HBPAiIIAAAoAB4gnwBAgggAACCCCAAAII4O/CAggggAACCCCAAAIIIIAAsP4CCCCAAAIIIIAAAggggAACCEAE8BMDAggggAACgPsLIIAAArAB/BIJAQTwGiCAAAIIIIAAAjD7C+CXKgoggAACCCCAAAJgAUoAAQQQQAABBMD2nwVQDQDV+jL8G6AOB/jrx789EK4boH0oYOfErbo0PPDpF2Dui//MHtsAhozHbBu7N2VAaBfAmCGpTaMnx8yJ7Zm9OWhYbs/41UEzg1sABg1ObpnAO2p2dINAPZtVm9m1/4dbf38Qd9UkgfdHkddLWbKdnQEuj99+WKDezILt7A1Qi3azbf9vdz/h1rvRYMatdx/BkDvvNoGOZw/Vm5vb2L5//9OXggtUsQXqjAgA77/62SOHoAouUHCBKrhAwQWq4AJVcIIqOEHBBerc4AFKADrAjxQKEHp/AQoPUCv+AnUkQdGy4N/hzfv5Bezy1UiW+gZ7AAAAAElFTkSuQmCC", "label": "wooden rolling pin"},
  {"box_2d": [706, 652, 900, 999], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACO0lEQVR42u3dwU4dMQyG0f/9X9qVukJVuUDLhvmOd0izyZGdkLmJZxN/i/sz4sNPGdz7UR9/geA+jPr4Hy1wn4w8wFMJ7toCd22B+2rUx/8wgbu4wMUF7uICFxe4iwvUAe7iAlcXuLjAXVzg4gJ3cYGrC1xd4OoCVxe4usDVBa4uAKAuAKAucHWBbwa4OMABuDzAAagDHAACAOIAB4BAHeBHCZx58Pc2zjTQFhiA+EpgV2hT5N2QIiAAgAABWwJLIQA/FcqAdgo4MpEHqB8bOiVAwCTg9CQAR6gJuEjhNlEdgAABtyq/6+3AfnIQMAvUp4G8gBYDCKwFFgMC32UAQA4QWD4L9oTIE0gCrTgR/K/BHhME/tmAwGTB8gZb3gDB8gQElicgQMBEIAkkAQECJgICNkcmAgLKgMATCQh4YywHEBAgQAABAQIE/u0kAQECyoCAMiDgeCUBB0wJECBAgIArJwQIPIuAAAEXEAkQIOAmMgFlIAcIaEshCeSA7iwElAEBZUCAAAECBLSuJKB7JwECBPRyvryAHCCgDAgQIEDAYiAHCPjODQFlIAc+KbB6FeyFQeU/otUFXjxWAHj9XCEH9ioFCgIfPPd8gTzA4yfCz02FXpD41cz5AadonCl1w8JdMx0IdCJ5NIDudHkAHfx9ycP3bAB8mWB1gdUFACxOsLrA6gJbnGB1gdUFVhdYXWCLEwBYnGB1gcUFtrbAKlEf/zsCW1tgiwtsbYKtLbBk1Mf/hmALE7z96xfMPaiHGtP3mgAAAABJRU5ErkJggg==", "label": "wooden spoon"},
  {"box_2d": [701, 755, 811, 999], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACXElEQVR42u3dSU7EUBAE0bz/pQNYgYQYxOD+3RG59K6eqv5gL7x9Ht7mzYMpwru8PnPW/1y3B4AvYq8fe/2PLmAHQA7A92KvPwB7/Q8rYAcAtwDIBQKQC4BbANwC4BYAtwC4BUAuYAcAuQByAQJwCxCAXCAAuQABuAVALhCAXIAA5AIByAFALhBAAG4AkAsEYBcIQC4AcoEA7AJ2AAJIIAA1APYZCCAAuQC1QAByggACkAsEkEAAAbgFArALUAvUAgE0A70Y6lIYQADdCAIIIIGWwc6DASQQQNdiqQC1QC0gFyABuwAJ2AWwC2AXYLVAAgnYl4EE7AIkYBfALgB2ggASsBNgFyABuwAJ2AVIwE4QAHYCvQAJyAW4PHoA9ADoAdADJIAeAD0A9vrPIdALYBfALgByAhKwCyAX4IDoARK4KQF2AZATkICdAL1APYC9CTgsCSSAHuByAuwEJIDdQC8AdgLsBGAnQE+AnQDsBOgJ0BOA3YAI9AJg7wL0BETQILQldDjKoFHoeFQX3OHGGEGno64JETQI7YpXnY3Oh7joe8rRDjf4vnzamFxNcN5KcS3AgavlhQBn7hg7IHqAmxpscoIdFD3ATRQ2OcEmJ9jkBJucYHaBTU6wyQl2fvQA62+w/RD4Hwk2OcEmJ5hdYJMTbHKCTU6wyQl2r9ED/JXBJifYvUcP8FuC2QU2OcEeJXqA6QF+aDC7wCYn2OQEm5xgkxNscoLZBTY5wSYn2OQEe/wEYAeYHuBTgtkFpoke4AOCTU6wyQlmF9jkBJucYJMTbHKCAOwCM+cF4AnFZUfGxfF8wgAAAABJRU5ErkJggg==", "label": "wooden spoon"},
  {"box_2d": [47, 321, 212, 399], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACUklEQVR42u3c0YoTURQF0f3/P10qiE8q4iTp2121Yd6n1pwOIkm2Kwf8+PnN9vTxL1PHP1WB/5o6/jEIfH3u+p8GmOvveQm8YfL8Oz0KvHHu+hsQ8InJ8w82ALcAyAX49OT5pxHYAXADcNn0ANj7jxAAtwC4BThgegDs/RcKgFwA3ATgFgC5AMfN3v9ZAZALIBfg2Nn7PyNAAHIB5AIcP3v/mwXATQBuAZAL2AEgATcBAbgFQC6AXIAA3ADgFuCu0wNg7w8Ae/9LBJALEIBcgAfM3v8lAeQCyAEIwC1AAAGoCez9zwLA3k8HUH8AvQIEoO0PoCegA+gA6u8fQR1AB9D/hHQAHYAMYAHU3ytAAPUH0Etg/QH0loAOoDcF9QB0AL0vsP4AemdwB9AB9NmADqAD6ONB9QdQ/wMB+oRoB9ABiAEWgLq/r0noizICqL+vyunLkpQAC6B+M0APQAegBlgA9ZsB5gbYAjD3bwGoASYHqN8NsA6gAzAD1O8GqD8ANcDcAJsbYHKAdQAdgBmgfjdA/QGoAep3A0wOUH8AZoHVrwaoP4D6vQCrXw2wAOo3A/QA1N8DIAaYHKC/f3//DqADqF8KUL8boH43wOQA9QegBpgcYHKA+t0AkwMsADfA5ACTAywAN8DkAAtADbAFoBaYHGCTC0wuMDnA5hZYANfO3n85wdwCO2Z6gNn7ryGYXGDnzd7/UYHNLbBzZ+//iMJuMHv/Owk2t8DkAJtbYHMLTA6w283e/1KC3XXff3V1/ytuYU/Zrx5p/59M/gKy1s7bN/8iyePvL/erAAAAAElFTkSuQmCC", "label": "wooden spatula"},
  {"box_2d": [0, 370, 166, 458], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACbElEQVR42u3dwY7aUBAF0fr/n+4oi2SZiQDj1666ErPEc4/62QJbPDgxM/P79VrYmHcKryeYK+Juv4dgLo29//kCM26B+Ub0AGPvH8DY+58qMAG4BUYOMCMXGLuAHGC+H3v/kde/COC1A81+gbeONOsB3jvibAX4zAHnxtxwxvr7Jn/+zC6Aj/8DswZgHpnr78ltB5iHR17/B4EZN8CMW2DkABOAW2ACcANMAHKBAAKQC0wjEEAXwgACSKDTYACtgb4TEAMQgB0ggQD0t0cC0N8htAh0k7gRCCCAHpVRC/S0WAAJBCAnCMAu0IPjASQgJwggALkAjUAjIBfALkACdgESsAtgFwggAbtAP6kRQGsggAQ6C6gFsAtgFyABuwAJBCAXwC6w5neRDwIgADsAeoAE0AOgB0AP8BiCfVtFHASAvT/6CXgAwdItY87pv1tg87ZB5/RfK7B+66hj6q8U+MaeW6b+ywS4Ivb+ewTADQBuAS6NvT/2/scL8IXI658sAG4BAnALIAcAtwDIBewAIBewA3BH7P2x9z9IwA7AfdEDYO8fAPb+AWDvf7sAcgGQCwRgF0AuAHKBAOwC2AXATQByAewC2AWwC2AXwC6AXQDkBNgFsAtgFwjALoBdALsAdgHsAmAnSCAAuwDICbALgJwAuwAJBBCAW4AE7AIgJyABuwB2ARqBRiABuQDICUjALtAINAKNgJ2AFoGdgATsAiSQQAAJJCAnIAH9KmgEEmgRNAPNgH0I9AAJJNC1oBnoati3A81AAAkk0Jlw7RD00eiDg5AAqxPBJxR4SBJ4GYJnRg/w3xBooi7/r/wCAh8RHlzrdhAAAAAASUVORK5CYII=", "label": "wooden spatula"},
  {"box_2d": [27, 465, 166, 584], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACO0lEQVR42u3d2XHDQAzAUPTfNNNADjt2JtwFWIHwvNJ49EHBypkHhztnnhx7/2UM89Ko449DmD8ca/dqhfmXefmqn73wf2x9P8BcM/L83xjMlWPvfxhhbh97/w8GM2aBsfR/JTCmsfd/IjAjFwhALjDjFpgA3AATQAABmAEmgADMADMBBBCAGGACCCCA+gMIoH8BAQQQgPJlQAABuPt7Ix5AAO7+AOoPIABzfwD2/gDs/QEEEIC7P4AA5P0B1O/uDyAAeX8A9v4AAgjA3R+Avd8kYAcgALmAfX2UfqleAPLdkmcuRV0EcLrC0ctxA1jTf67BDVui9/QfSIBc4J5l6YsATiK4bGX+IgDs/QEcQUAAboJbP5+yCAB7/24B7AK4BW7/kNQiAPQA2PtXCgQgF5B8U3ARAPb+ZQJ2AAJwCxCAGwDkAsgFkAOAXMAOAHIBOwABuAUIwA0AcgE7AHIAAnALEEAAagACcAMQgBsA5AJ2AOQABBCAGgDkAnYAAnADEIAbANwCBOAGIAA3AAEEoAYgADcAAdQfgBgA5AJ2AOQABNAd0AEIIIAArAAE0B3QAQjAC0AAAagBkAPUH0AAZoD6A1ADIAdADkAHoANgBqAD0AEwA9TvBkAOUL8bADkAcgACcAPI+8ENgBsAdz+4Aez9BwrYDwD2fuz9yPPfCIAbANwAyAGQA+AGADcAbgBwA+AGADcAcgDkAMgBkAMgB0AOgBuA+nv69/NLCbhw5PlPAHD5iNO/QaBZNx+RwNRM6dyBfgAAAABJRU5ErkJggg==", "label": "wooden spatula"},
  {"box_2d": [73, 442, 169, 508], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACTklEQVR42u3dwY7iUBTE0Pr/n/ZM73sxQ3fCC3ZJiLWPECvI3dqdg6/X1/6+++K/m7ccAwiv7/HpP4l/ugO/OnX80wy4cO768wm4Zer4Yw24efL8swx41+T5hxiAWoAjZu9/GwFHzd5/PwEHTp5/IwEHz95/BwG4BXjA7P0XCvCU2fsvEgC3ALgFeNwCkPf/rgC4BcAtAHIB5ALgFgC3ALgFwC0AcgE7AB8yPQD2/lcFkAsQgFsA3AIgF0AuQABuAZALBCAXALcAAcgF7ADgFgC5QAByAJALYBewA4BcwA4AcoEA5AIEEIBbwA4AcgHsAgHIAUAuYAcggADcAgHIBQggALUAAQSgFkAOAH0C+gSoBQIIIAA1AAHIBQIIIIAAAhADgFwggAACCCCAAAIIIIAAAggggAACCCAAm0AAAQTQl2AAAYgBFkAAAQQQgBmgn4jY+wPQA/Rz8QACcPf3t7EA9AALQN4fwAIIIAB1f09SCsDe3wP1Api9P4CeLByAvF8lEEBHJrqzEoBYYAF0b0wN0M3BADo8qgbo9q4coAPkcoB/OUE/PcACkPd/sEAAcoHJATa5gB1gkwvYAfa/C8AOMHt/ANMDzN7/SQIBzE0wOcAmF5hdYHKBBSAXmFxgkwtMLrDJBSYXmBxgkwtMLrC5BTa3wOYW2NwCm1xgcoHNLbC5BTa3wK6dvf94gd0we//JAptcYHKB3Th7/4kEu332/rME9p7Z+48R2Dtn7z+AYAfM3v9Gg501d/3tBjt37vrrIfbAacNfllprl+0PAWYu28nkEboAAAAASUVORK5CYII=", "label": "wooden spoon"},
  {"box_2d": [56, 230, 215, 329], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAAB/klEQVR42u3dW07DQBQFwdn/pg8gPgEhXiFO1ywAZUp97cROzDkfrX26zgOvfWWlN/9wCPv2im//IQj20xXf/ivB0vt/FrhoB/vFVd//FQW2tsAWF1hcYADiAnWALS6wugAAAgDiAmZAAnWBSUACEiDgVOhDoQRcGDEEGnB5lIB7BPH7ZAQI5AHcLScggTwAgfzXpn5d4NQFANQFjgQkQMAQSIAAAAJpgUMAAIG4wCEAgIDjoAQkIAEJEPC5GAABRwEJECAQBiDgSxMSyAMQkIAEJEDAEEiAQByAAAG/J5FAHoAAAb+t9PPaPIDHDBDwvBUJSMBTlwhIQAL5BDyBUAIE8gAEJOCZ3B7L7n8T5AGWByBgCAwBAQAEABAgAIAAAAI+EkiAgCEAEJ4BbwYISEACzgQSkIDjIABHAQIACAAgQMD7QQlIgAAAAlWBQ4AAAAJxAQlIgAAAAo6DEiBgCAgAcHGIAIE0gDvGGiBAwBfoJGAGCBDIAxBYHoCAKSBwBQJTIIG6wFYnyAuYAQIOAnECAKsT7GYrD3CvBHmB3XjlAe6OYIsTrC6wusAWJ1he4MwUSCBOsLoAgLrA6gKbBjQQF1hdYBKQQFxgdQEAixOsLgDADNQBzpwIvBlyWSAP4EbZPymc+1svryq7+T/K4b0/di663rz4K2/Gupf1BFKpcnHxBQbqAAAAAElFTkSuQmCC", "label": "wooden handle utensil"},
  {"box_2d": [72, 497, 167, 563], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACWElEQVR42u3dS07DQBAA0br/pRtlxQKQAvg3U9USa6jndpACnsD9M6+v44dVZk4bef4CAHP2eMufTDDXjr3/SQJz09z6+/3z28+Nc13wfA2+tfxKgCcFXwkwPLH4IoFZof1EgFln7P0nCMxqowc4mGBGLTAjBphV848BmKVHHf9vgNlh5Pl/FphR989OE4C6/i8C4xaYDYdR57+9A7PzuOvfAJiRC8gBZtQA4xh7/88CcoAZMcCMqf8bARvA7PR291EAdoEAEgigeyCCdqCboBXQrwABBCAHSIAA7AC9O0wAdoAE9H8kawdagTagFehVUA6AHoAA5P39y2AABCDvDwB7v/3JgTagDcDeH4C9f2eBAHqAUg1AAG4BApA/RR9A50gEEEBnyQQgBSAANwABdKBaAGIAAghADdCxqnKBzhV2CxBAR4sHIAYggADUAATQB4wEEIAXgADcAATQh4wF4AUggADUAATgBiAANQAEoAYggADUAARQvxmAAAIIwAxAAAHUH0AAAVgBWoAAAugloAVoAQKQArQALYAagADqD0AMgBsAOQDHT/0BqAHs1z+A+t392PtpAdzXH3s/8nx9P+76NQDqVwPg7gc3gLufC0aejzz/wQD2fjsA7n5wA+AGADUA7n7c/eAGQA4g76fr3wKIAahfDSDvBzeAvB/cAMgB6ncD1O8GkPeDG4D2vxtADED9agDkAPJ+6m/9u/5agPrVANSvFkAOIO/nQWPvx95/gwCoBXji2PuvEwA3AEvM6+cU55+2Eqw58vxjFNhl5rcWCEaY3Ow0H8AxH5kIcshDAAAAAElFTkSuQmCC", "label": "wooden handle utensil"},
  {"box_2d": [108, 513, 167, 561], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACSElEQVR42u3dwVIbMRRE0ff/P90xgTKmyCIJ9oxGfRpcBpc3904/SWMWzCyX5Pb98bh9ze/HfHnO/ectk3veOT+ek8c3bAf9yP2v2ema10nIM+AfNVxLRV6UcvyLaMgRqYZfWEG6BeSElOOv5CCpNpBTU45/voKskNOOylkjp90rZJUQcMpKkLXSzn+4gaTbQNYUkGJ8Ag6cgXQLSNYVEAKa+VN+/e9/dezlP0BAVheQan4CXi0gBHQLyBXSzp92/rTzawB+E0BAM3/a+dsFvOrzgJQLsAMQYAvETwABzkCdAvATYAUgAD8B+AnATwABToEKoAAKQIAJIMAEKAAB+AnYXYDrT4AzAAEmgAB7IH4C8PcZMAEEmAACigUoAAEEWAII6BWgAApAQLOAEICfgGIBIcAEEGACFAA/AQRYAgsFOAUSYAIUQAEIMAEK4BRMgBVAARSAAAKcAVx/AgiwBCiAAigAAfgJwE+AT0IUQAEUQAH24x8DQID7YAUggAD8BDgFKoAC4CfACuA+2AAQgJ8AWwABJkABFIAAE4CfAPwE2ALwE4CfAJ+FKgABJoAA/ATgtwUoAAH4r8yvAAqgAAqgAAQ4BRNgAuwB+AnAj7+J3/9Rx09ANT8B7fzTzj/V8FcRMAR0C5huAdPNP9MsYA4IAbXkhwuYG9Yn2dsvby99e8vRWeWCvguZ7QTM8uklf6mAmWYBc620808zOwFPNlDOP5dNN/1TBMzVU47/MwGzR4rR/1fBbJhy/L91MNunFvzPKkZk2/wCNjHSsJkHhK4AAAAASUVORK5CYII=", "label": "wooden handle utensil"}
]
```

The model predicts a JSON list, where each item represents a segmentation mask. Each item has a bounding box ("box_2d") in the format [y0, x0, y1, x1] with normalized coordinates between 0 and 1000, a label ("label") that identifies the object, and lastly the segmentation mask inside the bounding box, as base64 encoded png.

To use the mask, first you need to do base64 decoding, and then loading this string as a png. This will give you a probability map with values between 0 and 255. The mask needs to be resized to match the bounding box dimensions, then you can apply your confidence threshold, e.g. binarizing at 127 for the midpoint. Finally, pad the mask into an array of the size of the full image.

All these steps are done by the the parseSegmentationMasks function provided earlier.

Ultimately, use the plotSegmentationMasks function to visualize the decoded masks by overlaying it on the image.

const segmentationMasks = await parseSegmentationMasks(segmentationResponse.text ?? "[]", 1024, 1024);
await plotSegmentationMasks(fs.readFileSync(CUPCAKE_IMAGE_PATH), segmentationMasks);

Preliminary capabilities: pointing and 3D boxes

Pointing and 3D bounding boxes are experimental model capabilities. Check this other notebook to get a sneak peek on those upcoming capabilities.

What’s next?

For a more end-to-end example, the code from the AI Studio Spatial understanding example is available on Github.

You’ll also find multiple other examples of Gemini capabilities in the Gemini cookbook, in particular the Live API and the video understanding one.

Related to image recognition and reasoning, Market a jet backpack and Guess the shape examples are worth checking to continue your Gemini API discovery (Note: these examples still use the old SDK). And of course the pointing and 3d boxes example referenced earlier.