Gemini API: Using Gemini API to tag and caption images

You will use the Gemini model’s vision capabilities and the embedding model to add tags and captions to images of pieces of clothing.

These descriptions can be used alongside embeddings to allow you to search for specific pieces of clothing using natural language, or other images.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is one directory up from the notebook, hence we need to use ../ to go up one directory. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── examples
    └── Tag_and_caption_images.ipynb

Initialize SDK Client

With the new SDK, now you only need to initialize a client with you API key (or OAuth if using Vertex AI). The model is now set in each call.

const google = require("@google/genai") as typeof import("@google/genai");

const ai = new google.GoogleGenAI({ apiKey: GEMINI_API_KEY });

Select a model

Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. thinking notebook for more details and in particular learn how to switch the thiking off).

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-05-20";

Downloading dataset

First, you need to download a dataset with images. It contains images of various clothing that you can use to test the model.

const fs = require("fs") as typeof import("fs");
const path = require("path") as typeof import("path");
const AdmZip = require("adm-zip") as typeof import("adm-zip");

const DATA_URL = "https://storage.googleapis.com/generativeai-downloads/data/clothes-dataset.zip";
const EXTRACT_PATH = "../assets/tag_and_caption_images";

async function downloadAndExtractDataset(): Promise<void> {
  if (fs.existsSync(`${EXTRACT_PATH}/clothes-dataset`)) {
    console.log("Dataset already exists. Skipping download.");
    return;
  }

  const response = await fetch(DATA_URL);
  const buffer = await response.arrayBuffer();

  console.log("Extracting dataset...");
  await fs.promises.mkdir(EXTRACT_PATH, { recursive: true });

  const zipPath = path.join(EXTRACT_PATH, "clothes-dataset.zip");
  fs.writeFileSync(zipPath, Buffer.from(buffer));

  const zip = new AdmZip(zipPath);
  zip.extractAllTo(EXTRACT_PATH, true);

  console.log("Dataset extracted.");
}

await downloadAndExtractDataset();
Dataset already exists. Skipping download.
const images = fs
  .readdirSync(`${EXTRACT_PATH}/clothes-dataset`)
  .map((file) => path.join(EXTRACT_PATH, "clothes-dataset", file))
  .sort((a, b) => b.localeCompare(a));
console.log(`Found ${images.length} images in the dataset. ${images}`);
Found 10 images in the dataset. ../assets/tag_and_caption_images/clothes-dataset/9.jpg,../assets/tag_and_caption_images/clothes-dataset/8.jpg,../assets/tag_and_caption_images/clothes-dataset/7.jpg,../assets/tag_and_caption_images/clothes-dataset/6.jpg,../assets/tag_and_caption_images/clothes-dataset/5.jpg,../assets/tag_and_caption_images/clothes-dataset/4.jpg,../assets/tag_and_caption_images/clothes-dataset/3.jpg,../assets/tag_and_caption_images/clothes-dataset/2.jpg,../assets/tag_and_caption_images/clothes-dataset/10.jpg,../assets/tag_and_caption_images/clothes-dataset/1.jpg

Generating keywords

You can use the LLM to extract relevant keywords from the images.

Here is a helper function for calling Gemini API with images. Sleep is for ensuring that the quota is not exceeded. Refer to our princing page for current quotas.

async function generateKeywords(prompt: string, imagePath: string, sleepTime = 4000): Promise<string> {
  const imageBuffer = fs.readFileSync(imagePath);
  const start = performance.now();
  const response = await ai.models.generateContent({
    model: MODEL_ID,
    contents: [google.createPartFromBase64(imageBuffer.toString("base64"), "image/jpeg")],
    config: {
      systemInstruction: prompt,
    },
  });
  const end = performance.now();
  const duration = end - start;
  if (duration < sleepTime) {
    await new Promise((resolve) => setTimeout(resolve, sleepTime - duration));
  }
  return response.text ?? "";
}

First, define the list of possible keywords.

const keywords = [
  ["flannel", "shorts", "pants", "dress", "T-shirt", "shirt", "suit"],
  ["women", "men", "boys", "girls"],
  ["casual", "sport", "elegant"],
  ["fall", "winter", "spring", "summer"],
  ["red", "violet", "blue", "green", "yellow", "orange", "black", "white"],
  ["polyester", "cotton", "denim", "silk", "leather", "wool", "fur"],
].flat();

Go ahead and define a prompt that will help define keywords that describe clothing. In the following prompt, few-shot prompting is used to prime the LLM with examples of how these keywords should be generated and which are valid.

const KEYWORD_PROMPT = `
    You are an expert in clothing that specializes in tagging images of clothes,
    shoes, and accessories.
    Your job is to extract all relevant keywords from
    a photo that will help describe an item.
    You are going to see an image,
    extract only the keywords for the clothing, and try to provide as many
    keywords as possible.

    Allowed keywords: ${keywords.join(", ")}.

    Extract tags only when it is obvious that it describes the main item in
    the image. Return the keywords as a list of strings:

    example1: ["blue", "shoes", "denim"]
    example2: ["sport", "skirt", "cotton", "blue", "red"]
`;

Generate keywords for each of the images.

for (const image of images.slice(0, 5)) {
  const keywords = await generateKeywords(KEYWORD_PROMPT, image);
  tslab.display.jpeg(fs.readFileSync(image));
  console.log(`Keywords for ${path.basename(image)}: ${keywords}`);
}

Keywords for 9.jpg: ["shorts", "denim", "blue", "casual", "summer", "spring", "cotton"]

Keywords for 8.jpg: ["blue", "suit", "shirt", "men", "elegant", "white", "black"]

Keywords for 7.jpg: ["suit", "men", "elegant", "blue", "black", "silk"]

Keywords for 6.jpg: ["T-shirt", "dress", "casual"]

Keywords for 5.jpg: ["dress", "women", "elegant", "red", "spring", "polyester"]

Keyword correction and deduplication

Unfortunately, despite providing a list of possible keywords, the model, at least in theory, can return an invalid keyword. It may be a duplicate e.g. “denim” for “jeans”, or be completely unrelated to any keyword from the list.

To address these issues, you can use embeddings to map the keywords to predefined ones and remove unrelated ones.

import * as danfo from "danfojs-node";

const EMBEDDING_MODEL_ID = "models/embedding-001";

async function embedKeywords(keywords: string[]): Promise<danfo.DataFrame> {
  const embeddings = await ai.models.embedContent({
    model: EMBEDDING_MODEL_ID,
    contents: keywords,
    config: {
      taskType: "semantic_similarity",
    },
  });

  const data = keywords.map((keyword, index) => ({
    keyword,
    embedding: embeddings.embeddings[index].values,
  }));

  return new danfo.DataFrame(data);
}

const keywordsDf = await embedKeywords(keywords);
keywordsDf.head().print();
╔════════════╤═══════════════════╤═══════════════════╗
║            │ keyword           │ embedding         ║
╟────────────┼───────────────────┼───────────────────╢
║ 0          │ flannel           │ -0.010662257,-0…  ║
╟────────────┼───────────────────┼───────────────────╢
║ 1          │ shorts            │ 0.033239827,0.0…  ║
╟────────────┼───────────────────┼───────────────────╢
║ 2          │ pants             │ 0.012273628,-0.…  ║
╟────────────┼───────────────────┼───────────────────╢
║ 3          │ dress             │ 0.010586792,-0.…  ║
╟────────────┼───────────────────┼───────────────────╢
║ 4          │ T-shirt           │ -0.015854606,-0…  ║
╚════════════╧═══════════════════╧═══════════════════╝

For demonstration purposes, define a function that assesses the similarity between two embedding vectors. In this case, you will use cosine similarity, but other measures such as dot product work too.

function cosineSimilarity(array1: number[], array2: number[]): number {
  const dotProduct = array1.reduce((sum, value, index) => sum + value * array2[index], 0);
  const norm1 = Math.sqrt(array1.reduce((sum, value) => sum + value * value, 0));
  const norm2 = Math.sqrt(array2.reduce((sum, value) => sum + value * value, 0));
  return dotProduct / (norm1 * norm2);
}

Next, define a function that allows you to replace a keyword with the most similar word in the keyword dataframe that you have previously created.

Note that the threshold is decided arbitrarily, it may require tweaking depending on use case and dataset.

/* eslint-disable @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-call */

import * as danfo from "danfojs-node";

async function replaceWordWithMostSimilar(
  keyword: string,
  keywordsDf: danfo.DataFrame,
  threshold = 0.7
): Promise<string | null> {
  // No need for embeddings if the keyword is valid.
  if (keywordsDf.keyword.values.includes(keyword)) {
    return keyword;
  }

  const embedding = await ai.models.embedContent({
    model: EMBEDDING_MODEL_ID,
    contents: [keyword],
    config: {
      taskType: "semantic_similarity",
    },
  });

  const similarities = keywordsDf.embedding.values.map((rowEmbedding: string) =>
    cosineSimilarity(embedding.embeddings[0].values!, rowEmbedding.split(",").map(Number))
  ) as number[];

  const mostSimilarKeywordIndex = similarities.indexOf(Math.max(...similarities));
  if (similarities[mostSimilarKeywordIndex] < threshold) {
    return null;
  }
  return keywordsDf.loc({
    rows: [mostSimilarKeywordIndex],
    columns: ["keyword"],
  }).keyword.values[0] as string;
}

Here is an example of how these keywords can be mapped to a keyword with the closest meaning.

for (const word of ["purple", "tank top", "everyday"]) {
  const similarWord = await replaceWordWithMostSimilar(word, keywordsDf);
  console.log(`${word} -> ${similarWord}`);
}
purple -> violet
tank top -> T-shirt
everyday -> casual

You can now either leave words that do not fit our predefined categories or delete them. In this scenario, all words without a suitable replacement will be omitted.

async function mapGeneratedKeywordsToPredefined(
  generatedKeywords: string[],
  keywordsDataframe: danfo.DataFrame = keywordsDf
): Promise<Set<string>> {
  const outputKeywords = new Set<string>();
  for (const keyword of generatedKeywords) {
    const mappedKeyword = await replaceWordWithMostSimilar(keyword, keywordsDataframe);
    if (mappedKeyword) {
      outputKeywords.add(mappedKeyword);
    }
  }
  return outputKeywords;
}

console.log(await mapGeneratedKeywordsToPredefined(["white", "business", "sport", "women", "polyester"]));
console.log(await mapGeneratedKeywordsToPredefined(["blue", "jeans", "women", "denim", "casual"]));
Set(4) { 'white', 'sport', 'women', 'polyester' }
Set(4) { 'blue', 'denim', 'women', 'casual' }

Generating captions

const CAPTION_PROMPT = `
  You are an expert in clothing that specializes in describing images of
  clothes, shoes and accessories.
  Your job is to extract information from a photo that will help describe an item.
  You are going to see an image, focus only on the piece of clothing,
  ignore suroundings.
  Be specific, but stay concise, the description should only be one sentence long.
  Most important aspects are color, type of clothing, material, style
  and who is it meant for.
  If you are not sure about a part of the image, ignore it.
`;
for (const image of images.slice(0, 5)) {
  const caption = await generateKeywords(CAPTION_PROMPT, image);
  tslab.display.jpeg(fs.readFileSync(image));
  console.log(`Caption for ${path.basename(image)}: ${caption}`);
}

Caption for 9.jpg: A pair of distressed men's medium blue denim shorts featuring ripped details and a faded wash.

Caption for 8.jpg: A blue single-breasted men's suit jacket is styled with a white dress shirt, a black tie, and a white pocket square.

Caption for 7.jpg: A men's blue single-breasted tuxedo jacket features black satin shawl lapels, a one-button closure, and black-trimmed flap pockets.

Caption for 6.jpg: An oversized dusty pink cotton jersey t-shirt dress features a classic crew neck and wide, elbow-length sleeves.

Caption for 5.jpg: A women's deep magenta, flowy tunic dress with a V-neck, short sleeves, and elegant draped detailing.

Searching for specific clothes

Preparing out dataset

First, you need to generate caption and keywords for every image. Then, you will use embeddings, which will be used later to compare the images in the search dataset with other descriptions and images.

async function generateKeywordAndCaption(imagePath: string): Promise<{
  imagePath: string;
  keywords: string;
  caption: string;
}> {
  const keywords = await generateKeywords(KEYWORD_PROMPT, imagePath, 10000);
  let parsedKeywords: string[] = [];
  try {
    const matches = /\[([\s\S]*?)\]/.exec(keywords);
    if (matches?.[1]) {
      parsedKeywords = matches[1].split(",").map((keyword) => keyword.trim().replace(/"/g, ""));
    }
  } catch (error) {
    console.error(`Failed to parse keywords for ${imagePath} - ${keywords}:`, error);
  }
  const mappedKeywords = await mapGeneratedKeywordsToPredefined(parsedKeywords, keywordsDf);
  const caption = await generateKeywords(CAPTION_PROMPT, imagePath, 10000);
  return {
    imagePath,
    keywords: Array.from(mappedKeywords).join(", "),
    caption,
  };
}

You will use only the first 8 images, so the rest can be used for testing.

const describedDf = new danfo.DataFrame(await Promise.all(images.slice(0, 8).map(generateKeywordAndCaption)), {
  columns: ["imagePath", "keywords", "caption"],
});
describedDf.print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ imagePath         │ keywords          │ caption           ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ ../assets/tag_a…  │ shorts, denim, …  │ These are distr…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ ../assets/tag_a…  │ suit, men, eleg…  │ This is a light…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ ../assets/tag_a…  │ suit, blue, ele…  │ This is a men's…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ ../assets/tag_a…  │ dress, cotton     │ This is a pink …  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ ../assets/tag_a…  │ dress, women, s…  │ This is a short…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 5          │ ../assets/tag_a…  │ dress, women, s…  │ This is a color…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 6          │ ../assets/tag_a…  │ blue, denim       │ These are light…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 7          │ ../assets/tag_a…  │ flannel, shirt,…  │ This is a dark …  ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝
async function embedRow(row: string): Promise<string> {
  const embedding = await ai.models.embedContent({
    model: EMBEDDING_MODEL_ID,
    contents: [row],
    config: {
      taskType: "semantic_similarity",
    },
  });
  return embedding.embeddings[0].values.join(",");
}

const rowEmbeddings = await Promise.all(describedDf.caption.values.map(embedRow));
describedDf.addColumn("embedding", new danfo.Series(rowEmbeddings), { inplace: true });
describedDf.head().print();
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ imagePath         │ keywords          │ caption           │ embedding         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ ../assets/tag_a…  │ shorts, denim, …  │ These are distr…  │ 0.04526093,-0.0…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ ../assets/tag_a…  │ suit, men, eleg…  │ This is a light…  │ 0.044858858,-0.…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ ../assets/tag_a…  │ suit, blue, ele…  │ This is a men's…  │ 0.058658198,-0.…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ ../assets/tag_a…  │ dress, cotton     │ This is a pink …  │ 0.046889026,-0.…  ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ ../assets/tag_a…  │ dress, women, s…  │ This is a short…  │ 0.04646374,-0.0…  ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Finding clothes using natural language

async function findImageFromText(
  text: string,
  describedDataFrame: danfo.DataFrame = describedDf
): Promise<string | null> {
  const textEmbedding = await ai.models.embedContent({
    model: EMBEDDING_MODEL_ID,
    contents: [text],
    config: {
      taskType: "semantic_similarity",
    },
  });

  const similarities = describedDataFrame.embedding.values.map((rowEmbedding: string) =>
    cosineSimilarity(textEmbedding.embeddings[0].values, rowEmbedding.split(",").map(Number))
  ) as number[];

  const mostFittingImageIndex = similarities.indexOf(Math.max(...similarities));
  if (similarities[mostFittingImageIndex] < 0.7) {
    return null;
  }
  return describedDataFrame.loc({ rows: [mostFittingImageIndex], columns: ["imagePath"] }).imagePath
    .values[0] as string;
}
const imPath = await findImageFromText("A suit for a wedding.");
console.log(`Found image for the text: ${imPath}`);
Found image for the text: ../assets/tag_and_caption_images/clothes-dataset/8.jpg
tslab.display.jpeg(fs.readFileSync(imPath));

tslab.display.jpeg(fs.readFileSync(await findImageFromText("A colorful dress.")));

Finding similar clothes using images

async function findImageFromImage(imagePath: string): Promise<string | null> {
  const textEmbedding = await embedRow(await generateKeywordAndCaption(imagePath).then((res) => res.caption));
  const similarities = describedDf.embedding.values.map((rowEmbedding: string) =>
    cosineSimilarity(textEmbedding.split(",").map(Number), rowEmbedding.split(",").map(Number))
  ) as number[];
  const mostFittingImageIndex = similarities.indexOf(Math.max(...similarities));
  if (similarities[mostFittingImageIndex] < 0.7) {
    return null;
  }
  return describedDf.loc({ rows: [mostFittingImageIndex], columns: ["imagePath"] }).imagePath.values[0] as string;
}
const img1 = images[8];
tslab.display.jpeg(fs.readFileSync(img1));
tslab.display.jpeg(fs.readFileSync(await findImageFromImage(img1)));

const img2 = images[9];
tslab.display.jpeg(fs.readFileSync(img2));
tslab.display.jpeg(fs.readFileSync(await findImageFromImage(img2)));

Summary

You have used Gemini API’s JS SDK to tag and caption images of clothing. Using embedding models, you were able to search a database of images for clothing matching our description, or similar to the provided clothing item.