Gemini API: Question Answering LlamaIndex and Chroma

Overview

Gemini is a family of generative AI models that lets developers generate content and solve problems. These models are designed and trained to handle both text and images as input.

LlamaIndex is a simple, flexible data framework that can be used by Large Language Model(LLM) applications to connect custom data sources to LLMs.

Chroma is an open-source embedding database focused on simplicity and developer productivity. Chroma allows users to store embeddings and their metadata, embed documents and queries, and search the embeddings quickly.

In this notebook, you’ll learn how to create an application that answers questions using data from a website with the help of Gemini, LlamaIndex, and Chroma.

Setup

Install the Google GenAI SDK

Install the Google GenAI SDK from npm.

$ npm install @google/genai

Setup your API key

You can create your API key using Google AI Studio with a single click.

Remember to treat your API key like a password. Don’t accidentally save it in a notebook or source file you later commit to GitHub. In this notebook we will be storing the API key in a .env file. You can also set it as an environment variable or use a secret manager.

Here’s how to set it up in a .env file:

$ touch .env
$ echo "GEMINI_API_KEY=<YOUR_API_KEY>" >> .env
Tip

Another option is to set the API key as an environment variable. You can do this in your terminal with the following command:

$ export GEMINI_API_KEY="<YOUR_API_KEY>"

Load the API key

To load the API key from the .env file, we will use the dotenv package. This package loads environment variables from a .env file into process.env.

$ npm install dotenv

Then, we can load the API key in our code:

const dotenv = require("dotenv") as typeof import("dotenv");

dotenv.config({
  path: "../../.env",
});

const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? "";
if (!GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is not set in the environment variables");
}
console.log("GEMINI_API_KEY is set in the environment variables");
GEMINI_API_KEY is set in the environment variables
Note

In our particular case the .env is is two directories up from the notebook, hence we need to use ../../ to go up two directories. If the .env file is in the same directory as the notebook, you can omit it altogether.

│
├── .env
└── examples
    └── llamaindex
        └── Gemini_LlamaIndex_QA_Chroma_WebPageReader.ipynb

Select a model

Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. thinking notebook for more details and in particular learn how to switch the thiking off).

const tslab = require("tslab") as typeof import("tslab");

const MODEL_ID = "gemini-2.5-flash-preview-05-20";

Basic steps

LLMs are trained offline on a large corpus of public data. Hence they cannot answer questions based on custom or private data accurately without additional context.

If you want to make use of LLMs to answer questions based on private data, you have to provide the relevant documents as context alongside your prompt. This approach is called Retrieval Augmented Generation (RAG).

You will use this approach to create a question-answering assistant using the Gemini text model integrated through LlamaIndex. The assistant is expected to answer questions about Google’s Gemini model. To make this possible you will add more context to the assistant using data from a website.

In this tutorial, you’ll implement the two main components in a RAG-based architecture:

  1. Retriever Based on the user’s query, the retriever retrieves relevant snippets that add context from the document. In this tutorial, the document is the website data. The relevant snippets are passed as context to the next stage - “Generator”.

  2. Generator The relevant snippets from the website data are passed to the LLM along with the user’s query to generate accurate answers.

You’ll learn more about these stages in the upcoming sections while implementing the application.

1. Retriever

In this stage, you will perform the following steps:

  1. Read and parse the website data using LlamaIndex.

  2. Create embeddings of the website data. Embeddings are numerical representations (vectors) of text. Hence, text with similar meaning will have similar embedding vectors. You’ll make use of Gemini’s embedding model to create the embedding vectors of the website data.

  3. Store the embeddings in Chroma’s vector store. Chroma is a vector database. The Chroma vector store helps in the efficient retrieval of similar vectors. Thus, for adding context to the prompt for the LLM, relevant embeddings of the text matching the user’s question can be retrieved easily using Chroma.

  4. Create a Retriever from the Chroma vector store. The retriever will be used to pass relevant website embeddings to the LLM along with user queries.

Read and parse the website data

LlamaIndex provides a wide variety of data loaders. To know more about how to read and parse input data from different sources using the data loaders of LlamaIndex, read LlamaIndex’s loading data guide.

You can use variety of HTML parsers to extract the required text from the html content.

In this example, you’ll use Javascripts’s Cheerio library to parse the website data. After processing, the extracted text should be converted back to LlamaIndex’s Document format.

import * as cheerio from "cheerio";
import { Document } from "@llamaindex/core/schema";

async function loadWebPage(url: string): Promise<Document[]> {
  const response = await fetch(url);
  const html = await response.text();
  const $ = cheerio.load(html);
  const textContent = $("p")
    .map((_, el) => $(el).text())
    .get()
    .join("\n");
  return [new Document({ text: textContent })];
}

const documents = await loadWebPage(
  "https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/"
);
console.log("Loaded Documents:", JSON.stringify(documents, null, 2));
Loaded Documents: [
  {
    "id_": "6cb005e5-cc40-4c7b-80d2-81e4839f7496",
    "metadata": {},
    "excludedEmbedMetadataKeys": [],
    "excludedLlmMetadataKeys": [],
    "relationships": {},
    "text": "Mar 25, 2025\n\n          Gemini 2.5 is a thinking model, designed to tackle increasingly complex problems. Our first 2.5 model, Gemini 2.5 Pro Experimental, leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.\n        \nLast updated March 26\nToday we’re introducing Gemini 2.5, our most intelligent AI model. Our first 2.5 release is an experimental version of 2.5 Pro, which is state-of-the-art on a wide range of benchmarks and debuts at #1 on LMArena by a significant margin.\nGemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.\nIn the field of AI, a system’s capacity for “reasoning” refers to more than just classification and prediction. It refers to its ability to analyze information, draw logical conclusions, incorporate context and nuance, and make informed decisions.\nFor a long time, we’ve explored ways of making AI smarter and more capable of reasoning through techniques like reinforcement learning and chain-of-thought prompting. Building on this, we recently introduced our first thinking model, Gemini 2.0 Flash Thinking.\nNow, with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.\nGemini 2.5 Pro Experimental is our most advanced model for complex tasks. It tops the LMArena leaderboard — which measures human preferences — by a significant margin, indicating a highly capable model equipped with high-quality style. 2.5 Pro also shows strong reasoning and code capabilities, leading on common coding, math and science benchmarks.\nGemini 2.5 Pro is available now in Google AI Studio and in the Gemini app for Gemini Advanced users, and will be coming to Vertex AI soon. We’ll also introduce pricing in the coming weeks, enabling people to use 2.5 Pro with higher rate limits for scaled production use.\nUpdated March 26 with new MRCR (Multi Round Coreference Resolution) evaluations\nGemini 2.5 Pro is state-of-the-art across a range of benchmarks requiring advanced reasoning. Without test-time techniques that increase cost, like majority voting, 2.5 Pro leads in math and science benchmarks like GPQA and AIME 2025.\nIt also scores a state-of-the-art 18.8% across models without tool use on Humanity’s Last Exam, a dataset designed by hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.\nWe’ve been focused on coding performance, and with Gemini 2.5 we’ve achieved a big leap over 2.0 — with more improvements to come. 2.5 Pro excels at creating visually compelling web apps and agentic code applications, along with code transformation and editing. On SWE-Bench Verified, the industry standard for agentic code evals, Gemini 2.5 Pro scores 63.8% with a custom agent setup.\nHere’s an example of how 2.5 Pro can use its reasoning capabilities to create a video game by producing the executable code from a single line prompt.\nGemini 2.5 builds on what makes Gemini models great — native multimodality and a long context window. 2.5 Pro ships today with a 1 million token context window (2 million coming soon), with strong performance that improves over previous generations. It can comprehend vast datasets and handle complex problems from different information sources, including text, audio, images, video and even entire code repositories.\nDevelopers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.\nAs always, we welcome feedback so we can continue to improve Gemini’s impressive new abilities at a rapid pace, all with the goal of making our AI more helpful.\nLet’s stay in touch. Get the latest news from Google in your inbox.\n\n              Follow Us\n            ",
    "textTemplate": "",
    "metadataSeparator": "\n",
    "type": "DOCUMENT",
    "hash": "k7FXp5mXlcyCy1+xm3yxQ/Kk8xQ9/mgcnJQeSn1TDiU="
  }
]

Initialize Gemini’s embedding model

To create the embeddings from the website data, you’ll use Gemini’s embedding model, gemini-embedding-001 which supports creating text embeddings.

To use this embedding model, you have to import GeminiEmbedding from LlamaIndex. To know more about the embedding model, read Google AI’s language documentation.

import { GeminiEmbedding, GEMINI_EMBEDDING_MODEL } from "@llamaindex/google";

const embedding_model = new GeminiEmbedding({ model: GEMINI_EMBEDDING_MODEL.EMBEDDING_001 });
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.

Initialize Gemini

You must import Gemini from LlamaIndex to initialize your model. In this example, you will use gemini-2.5-flash, as it supports text summarization. To know more about the text model, read Google AI’s model documentation.

You can configure the model parameters such as temperature or topP, using the generationConfig parameter when initializing the Gemini LLM. To learn more about the model parameters and their uses, read Google AI’s concepts guide.

import { gemini, GEMINI_MODEL } from "@llamaindex/google";

const llm = gemini({ model: GEMINI_MODEL.GEMINI_2_5_FLASH_PREVIEW });
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.

Store the data using Chroma

Next, you’ll store the embeddings of the website data in Chroma’s vector store using LlamaIndex.

First, you have to initiate a Javascript client in chromadb. You will use the ChromaClient. You can read more about the different clients in Chroma in the client reference guide.

After initializing the client, you have to create a Chroma collection. You’ll then initialize the ChromaVectorStore class in LlamaIndex using the collection created in the previous step.

Next, you have to set Settings and create storage contexts for the vector store.

Settings is a collection of commonly used resources that are utilized during the indexing and querying phase in a LlamaIndex pipeline. You can specify the LLM, Embedding model, etc that will be used to create the application in the Settings. To know more about Settings, read the module guide for Settings.

StorageContext is an abstraction offered by LlamaIndex around different types of storage. To know more about storage context, read the storage context API guide.

The final step is to load the documents and build an index over them. LlamaIndex offers several indices that help in retrieving relevant context for a user query. Here you’ll use the VectorStoreIndex since the website embeddings have to be stored in a vector store.

To create the index you have to pass the storage context along with the documents to the fromDocuments function of VectorStoreIndex. The VectorStoreIndex uses the embedding model specified in the Settings to create embedding vectors from the documents and stores these vectors in the vector store specified in the storage context. To know more about the VectorStoreIndex you can read the Using VectorStoreIndex guide.

import { ChromaClient } from "chromadb";
import { ChromaVectorStore } from "@llamaindex/chroma";
import { Settings, VectorStoreIndex, storageContextFromDefaults } from "llamaindex";

Settings.llm = llm;
Settings.embedModel = embedding_model;

const CHROMA_HOST = process.env.CHROMA_HOST ?? "localhost";
const CHROMA_PORT = parseInt(process.env.CHROMA_PORT ?? "8000");

const chroma_client = new ChromaClient({
  host: CHROMA_HOST,
  port: CHROMA_PORT,
});
const collection = await chroma_client.getOrCreateCollection({
  name: "quickstart",
});

const vector_store = new ChromaVectorStore({ collectionName: "quickstart" });
const storage_context = await storageContextFromDefaults({ vectorStore: vector_store });

const index = await VectorStoreIndex.fromDocuments(documents, {
  storageContext: storage_context,
});

Create a retriever using Chroma

You’ll now create a retriever that can retrieve data embeddings from the newly created Chroma vector store.

You can use the collection name to initialize the ChromaVectorStore in which you store the embeddings of the website data. You can then use the fromVectorStore function of VectorStoreIndex to load the index.

const quickstart_vector_store = new ChromaVectorStore({ collectionName: "quickstart" });

const quickstart_index = await VectorStoreIndex.fromVectorStore(quickstart_vector_store);
const test_query_engine = quickstart_index.asQueryEngine();
const response = await test_query_engine.query({ query: "AIME" });
tslab.display.markdown(response.toString());

AIME 2025 is a math and science benchmark in which Gemini 2.5 Pro leads.

2. Generator

The Generator prompts the LLM for an answer when the user asks a question. The retriever you created in the previous stage from the Chroma vector store will be used to pass relevant embeddings from the website data to the LLM to provide more context to the user’s query.

You’ll perform the following steps in this stage:

  1. Create a prompt for answering any question using LlamaIndex.
  2. Use a query engine to ask a question and prompt the model for an answer.

Create prompt templates

You’ll use LlamaIndex’s PromptTemplate to generate prompts to the LLM for answering questions.

In the textQaPrompt, the variable query will be replaced later by the input question, and the variable context will be replaced by the relevant text from the website retrieved from the Chroma vector store.

import { TextQAPrompt, PromptTemplate, getResponseSynthesizer } from "llamaindex";

const textQaPrompt = `
    You are an assistant for question-answering tasks.
    Use the following context to answer the question.
    If you don't know the answer, just say that you don't know.
    Use five sentences maximum and keep the answer concise.\n
    Question: {query} \nContext: {context} \nAnswer:
`;

const response_synthesizer = getResponseSynthesizer("compact", {
  textQATemplate: new PromptTemplate({
    template: textQaPrompt,
    templateVars: ["query", "context"],
  }) as TextQAPrompt,
});

Prompt the model using Query Engine

You will use the asQueryEngine function of the VectorStoreIndex to create a query engine from the index using the response_synthesizer passed as the value for the responseSynthesizer argument. You can then use the query function of the query engine to prompt the LLM. To know more about custom prompting in LlamaIndex, read LlamaIndex’s prompts usage pattern documentation.

const query_engine = index.asQueryEngine({ responseSynthesizer: response_synthesizer });
const answer = await query_engine.query({ query: "What is Gemini?" });
tslab.display.markdown(answer.toString());

Gemini 2.5 is Google’s most intelligent AI model, described as a “thinking model” designed to tackle increasingly complex problems. It is capable of reasoning through its thoughts before responding, which enhances performance and accuracy. This model combines an enhanced base with improved post-training, featuring native multimodality and a long context window. The current experimental version, Gemini 2.5 Pro, leads common benchmarks and demonstrates strong reasoning and code capabilities. It can comprehend vast datasets from various sources, including text, audio, images, and video.

What’s next?

This notebook showed only one possible use case for langchain with Gemini API. You can find many more here.