7  Retrieval Augmented Generation

LLMs generally have limited knowledge; they only know what was ‘taught’ to them based on their training texts. Any new information not included in these texts is beyond their generation capability. It is impractical to retrain LLMs every time we want them to generate text with new information. A simple solution is to provide the necessary information directly in our queries. However, if we do not know exactly what we are looking for and the required information might be from a thick technical manual, a more practical approach is Retrieval-Augmented Generation (RAG). In this chapter, we will explore how to perform RAG by pairing our knowledge in the form of a vector database with LLMs.

In this chapter, we will explore how to perform RAG by pairing our knowledge in the form of a vector database with LLMs. So far, most resources I have found are in Python, and I aim to achieve this using R.

7.1 Background

[deepseek-r1:7b]

RAG is an approach that enhances LLMs by augmenting their capabilities with document retrieval systems. This integration allows for more accurate, relevant, and contextually informed responses compared to relying solely on the model’s internal knowledge.

How It Works:

  • Document Retrieval: When a query is received, RAG first retrieves relevant documents or passages from an external source (e.g., a vector database, described below) using techniques like embedding-based search.
  • Context Integration: The retrieved documents are then provided to the LLM as context (i.e. augment), enabling the model to generate responses informed by both its internal knowledge and the specific retrieved information.

Key Components:

  • Embedding-Based Retrieval: Documents are converted into vector representations, allowing efficient matching with query vectors through similarity measures.
  • Efficient Search: Techniques like vector databases or similarity search quickly identify relevant documents, ensuring timely retrieval without significant computational overhead.

Why Use RAG?

  • Enhanced Accuracy: By incorporating up-to-date or specific data, RAG reduces reliance on the model’s potentially outdated or generic knowledge.
  • Handling Specialized Queries: RAG is particularly effective for domain-specific questions requiring detailed information beyond the model’s training data.
  • Efficiency: Focusing the model on a subset of relevant documents can optimize performance and reduce computational demands.

Use Cases:

  • Customer service chatbots needing access to product details or customer records.
  • Healthcare applications where accurate medical guidelines are crucial.

RAG differs from fine-tuning (which involves retraining the model) or prompt engineering (which relies on phrasing) by adding external context without altering the model itself. RAG effectively bridges the gap between LLM general knowledge and specific, detailed information, offering more accurate and relevant responses through augmented input.

7.2 Vector database

[gemma3:12b]

At its core, a vector database is a specialized type of database that is optimized for storing, searching, and managing vector embeddings. Let’s unpack each of those terms:

Vector Embeddings: These are numerical representations of data (text, images, audio, video, etc.). They’re generated using machine learning models (often transformer-based models like Sentence Transformers, OpenAI’s embeddings models, or others). The key is that similar data points (e.g., semantically similar sentences) will have vector embeddings that are close together in the vector space. Dissimilar data points will have embeddings that are farther apart. Think of it like mapping concepts to points in a multi-dimensional space. The dimensions represent various semantic features. A sentence like “The cat sat on the mat” will have a vector embedding that’s closer to “A feline rested upon the rug” than to “The stock market crashed yesterday.”

Database: Like a traditional database (e.g., MySQL, PostgreSQL), a vector database stores data. However, instead of storing data in rows and columns, it stores the vector embeddings. The real power comes from its search capabilities.

Optimized for Similarity Search: This is the critical difference. Traditional databases are great for exact matches (e.g., find all users with the name “Alice”). Vector databases are optimized for finding the most similar vectors. This is often done using algorithms like Approximate Nearest Neighbor (ANN) search, which trades off a small amount of accuracy for massive speed improvements when dealing with large datasets. ANN techniques allow for very fast retrieval of the nearest neighbors in the vector space.

Commonly used vector databases are Chroma, FAISS, Pinecode and other databases (https://www.datacamp.com/blog/the-top-5-vector-databases, https://lakefs.io/blog/12-vector-databases-2023/).

7.3 Basic RAG

We learned about text embedding and how we can utilize it to find texts related to our query in Chapter 5. Now, in the context of RAG, let’s go back to the text in Section 5.2.1 (https://ollama.com/blog/embedding-models). Suppose that now, we want to ask a query, find the related text, then let an LLM answer our query based on the information (retrieved text) given to it.

As described before, there are several vector databases available for RAG of LLM. In this book, we will use Chroma (https://www.trychroma.com/) and rchroma (https://cynkra.github.io/rchroma/) for setting up our RAG. To my knowledge, rchroma is the only one available for R at the time of writing. Make sure you have a working Chroma installation (https://docs.trychroma.com/docs/overview/getting-started#install-manually) using client-server mode (https://docs.trychroma.com/docs/run-chroma/client-server) accessible at localhost:8000 (e.g. http://localhost:8000/api/v2). We will also used our beloved rollama for generating embeddings and text generation using LLMs of our choice.

Load these packages,

Make sure that your Ollama is running,

ping_ollama()  # ensure it's running
▶ Ollama (v0.15.1) is running at <http://localhost:11434>!

Make sure Chroma server is running. In my case, I am using the client-server mode, and I check its status at http://localhost:8000/api/v2. If you are running Chroma via Docker, you may check by chroma_docker_running().

We create a Chroma client connection,

client = chroma_connect()

and check its status using heartbeat(),

heartbeat(client)
[1] 1.772429e+18

7.3.1 Setting up ChromaDB collection

Next, we create a ChromaDB collection named “my_collection”,

create_collection(client, "my_collection")
$id
[1] "0a908e88-ba2f-458e-8ef3-03913e70b355"

$name
[1] "my_collection"

$configuration_json
$configuration_json$hnsw
$configuration_json$hnsw$space
[1] "l2"

$configuration_json$hnsw$ef_construction
[1] 100

$configuration_json$hnsw$ef_search
[1] 100

$configuration_json$hnsw$max_neighbors
[1] 16

$configuration_json$hnsw$resize_factor
[1] 1.2

$configuration_json$hnsw$sync_threshold
[1] 1000


$configuration_json$spann
NULL

$configuration_json$embedding_function
NULL


$schema
$schema$defaults
$schema$defaults$string
$schema$defaults$string$fts_index
$schema$defaults$string$fts_index$enabled
[1] FALSE

$schema$defaults$string$fts_index$config
named list()


$schema$defaults$string$string_inverted_index
$schema$defaults$string$string_inverted_index$enabled
[1] TRUE

$schema$defaults$string$string_inverted_index$config
named list()



$schema$defaults$float_list
$schema$defaults$float_list$vector_index
$schema$defaults$float_list$vector_index$enabled
[1] FALSE

$schema$defaults$float_list$vector_index$config
$schema$defaults$float_list$vector_index$config$space
[1] "l2"

$schema$defaults$float_list$vector_index$config$hnsw
$schema$defaults$float_list$vector_index$config$hnsw$ef_construction
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$max_neighbors
[1] 16

$schema$defaults$float_list$vector_index$config$hnsw$ef_search
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$num_threads
[1] 28

$schema$defaults$float_list$vector_index$config$hnsw$batch_size
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$sync_threshold
[1] 1000

$schema$defaults$float_list$vector_index$config$hnsw$resize_factor
[1] 1.2





$schema$defaults$sparse_vector
$schema$defaults$sparse_vector$sparse_vector_index
$schema$defaults$sparse_vector$sparse_vector_index$enabled
[1] FALSE

$schema$defaults$sparse_vector$sparse_vector_index$config
$schema$defaults$sparse_vector$sparse_vector_index$config$embedding_function
$schema$defaults$sparse_vector$sparse_vector_index$config$embedding_function$type
[1] "unknown"


$schema$defaults$sparse_vector$sparse_vector_index$config$bm25
[1] FALSE




$schema$defaults$int
$schema$defaults$int$int_inverted_index
$schema$defaults$int$int_inverted_index$enabled
[1] TRUE

$schema$defaults$int$int_inverted_index$config
named list()



$schema$defaults$float
$schema$defaults$float$float_inverted_index
$schema$defaults$float$float_inverted_index$enabled
[1] TRUE

$schema$defaults$float$float_inverted_index$config
named list()



$schema$defaults$bool
$schema$defaults$bool$bool_inverted_index
$schema$defaults$bool$bool_inverted_index$enabled
[1] TRUE

$schema$defaults$bool$bool_inverted_index$config
named list()




$schema$keys
$schema$keys$`#document`
$schema$keys$`#document`$string
$schema$keys$`#document`$string$fts_index
$schema$keys$`#document`$string$fts_index$enabled
[1] TRUE

$schema$keys$`#document`$string$fts_index$config
named list()


$schema$keys$`#document`$string$string_inverted_index
$schema$keys$`#document`$string$string_inverted_index$enabled
[1] FALSE

$schema$keys$`#document`$string$string_inverted_index$config
named list()




$schema$keys$`#embedding`
$schema$keys$`#embedding`$float_list
$schema$keys$`#embedding`$float_list$vector_index
$schema$keys$`#embedding`$float_list$vector_index$enabled
[1] TRUE

$schema$keys$`#embedding`$float_list$vector_index$config
$schema$keys$`#embedding`$float_list$vector_index$config$space
[1] "l2"

$schema$keys$`#embedding`$float_list$vector_index$config$source_key
[1] "#document"

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw
$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$ef_construction
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$max_neighbors
[1] 16

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$ef_search
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$num_threads
[1] 28

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$batch_size
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$sync_threshold
[1] 1000

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$resize_factor
[1] 1.2








$metadata
NULL

$dimension
NULL

$tenant
[1] "default_tenant"

$database
[1] "default_database"

$log_position
[1] 0

$version
[1] 0

Chroma and rchroma currently uses “l2” similarity index as the default distance function.

In order to add text documents to “the collection”my_collection”, we will use add_documents() function. To properly use the function, it involves several steps. First, we prepare the text, for example we use text strings from https://ollama.com/blog/embedding-models,

documents = c(
  "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old"
)

and convert the text into text embeddings via nomic-embed-text embedding model,

embeddings = embed_text(documents, "nomic-embed-text")

The function expects unnamed list of numeric vectors for the embeddings option, therefore we have to convert the embeddings from tibble to unnamed list,

unnamed_embeddings = apply(embeddings, 2, as.numeric) |> 
  split(f = 1:length(documents)) |> unname()

Lastly, we add these objects to the function,

add_documents(
  client,
  "my_collection",
  documents = documents,
  ids = paste0("doc", 1:length(documents)),
  embeddings = unnamed_embeddings
)
named list()

Note that the option ids is compulsory. So we generate the ids as combinations of “doc” and “1:length(documents)” e.g. “doc1”, “doc2”, and so on.

We can view configuration details of “my_collection”,

get_collection(client, "my_collection")
$id
[1] "0a908e88-ba2f-458e-8ef3-03913e70b355"

$name
[1] "my_collection"

$configuration_json
$configuration_json$hnsw
$configuration_json$hnsw$space
[1] "l2"

$configuration_json$hnsw$ef_construction
[1] 100

$configuration_json$hnsw$ef_search
[1] 100

$configuration_json$hnsw$max_neighbors
[1] 16

$configuration_json$hnsw$resize_factor
[1] 1.2

$configuration_json$hnsw$sync_threshold
[1] 1000


$configuration_json$spann
NULL

$configuration_json$embedding_function
NULL


$schema
$schema$defaults
$schema$defaults$string
$schema$defaults$string$fts_index
$schema$defaults$string$fts_index$enabled
[1] FALSE

$schema$defaults$string$fts_index$config
named list()


$schema$defaults$string$string_inverted_index
$schema$defaults$string$string_inverted_index$enabled
[1] TRUE

$schema$defaults$string$string_inverted_index$config
named list()



$schema$defaults$float_list
$schema$defaults$float_list$vector_index
$schema$defaults$float_list$vector_index$enabled
[1] FALSE

$schema$defaults$float_list$vector_index$config
$schema$defaults$float_list$vector_index$config$space
[1] "l2"

$schema$defaults$float_list$vector_index$config$hnsw
$schema$defaults$float_list$vector_index$config$hnsw$ef_construction
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$max_neighbors
[1] 16

$schema$defaults$float_list$vector_index$config$hnsw$ef_search
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$num_threads
[1] 28

$schema$defaults$float_list$vector_index$config$hnsw$batch_size
[1] 100

$schema$defaults$float_list$vector_index$config$hnsw$sync_threshold
[1] 1000

$schema$defaults$float_list$vector_index$config$hnsw$resize_factor
[1] 1.2





$schema$defaults$sparse_vector
$schema$defaults$sparse_vector$sparse_vector_index
$schema$defaults$sparse_vector$sparse_vector_index$enabled
[1] FALSE

$schema$defaults$sparse_vector$sparse_vector_index$config
$schema$defaults$sparse_vector$sparse_vector_index$config$embedding_function
$schema$defaults$sparse_vector$sparse_vector_index$config$embedding_function$type
[1] "unknown"


$schema$defaults$sparse_vector$sparse_vector_index$config$bm25
[1] FALSE




$schema$defaults$int
$schema$defaults$int$int_inverted_index
$schema$defaults$int$int_inverted_index$enabled
[1] TRUE

$schema$defaults$int$int_inverted_index$config
named list()



$schema$defaults$float
$schema$defaults$float$float_inverted_index
$schema$defaults$float$float_inverted_index$enabled
[1] TRUE

$schema$defaults$float$float_inverted_index$config
named list()



$schema$defaults$bool
$schema$defaults$bool$bool_inverted_index
$schema$defaults$bool$bool_inverted_index$enabled
[1] TRUE

$schema$defaults$bool$bool_inverted_index$config
named list()




$schema$keys
$schema$keys$`#embedding`
$schema$keys$`#embedding`$float_list
$schema$keys$`#embedding`$float_list$vector_index
$schema$keys$`#embedding`$float_list$vector_index$enabled
[1] TRUE

$schema$keys$`#embedding`$float_list$vector_index$config
$schema$keys$`#embedding`$float_list$vector_index$config$space
[1] "l2"

$schema$keys$`#embedding`$float_list$vector_index$config$source_key
[1] "#document"

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw
$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$ef_construction
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$max_neighbors
[1] 16

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$ef_search
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$num_threads
[1] 28

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$batch_size
[1] 100

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$sync_threshold
[1] 1000

$schema$keys$`#embedding`$float_list$vector_index$config$hnsw$resize_factor
[1] 1.2






$schema$keys$`#document`
$schema$keys$`#document`$string
$schema$keys$`#document`$string$fts_index
$schema$keys$`#document`$string$fts_index$enabled
[1] TRUE

$schema$keys$`#document`$string$fts_index$config
named list()


$schema$keys$`#document`$string$string_inverted_index
$schema$keys$`#document`$string$string_inverted_index$enabled
[1] FALSE

$schema$keys$`#document`$string$string_inverted_index$config
named list()






$metadata
NULL

$dimension
[1] 768

$tenant
[1] "default_tenant"

$database
[1] "default_database"

$log_position
[1] 0

$version
[1] 0

7.3.2 Retrieving information from ChromaDB collection

Before we try out basic RAG, let’s look at how to retrieve/query the collection. We set up the query text and its embedding,

query_text = "What animals are llamas related to?"
query_embedding = embed_text(query_text, "nomic-embed-text:latest") |> as.numeric()

Then, we query “my_collection”. Note we use rchroma::query() because it overlaps with rollama’s query() function,

result = rchroma::query(
  client,
  "my_collection",
  query_embeddings = list(query_embedding),
  n_results = 3
)

where n_results is the number of results to return per query. We set it to “3”.

We can view these three results and their distances,

result$documents[[1]]
[[1]]
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"

[[2]]
[1] "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall"

[[3]]
[1] "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands"
result$distances[[1]]
[[1]]
[1] 226.268

[[2]]
[1] 291.6361

[[3]]
[1] 396.8216

Specifically, we are interested in the first result,

output = result$documents[[1]][[1]]
output
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"

7.3.3 Basic RAG

Now, we integrate the returned query output in our prompt. We can write our basic prompt like this,

q_text = paste("Based on the following information:",
               output,
               "\nAnswer the query:",
               query_text,
               "\nUse only the information provided. Do not reference external knowledge, assumptions, or sources not explicitly included in the given text. If the information is insufficient to answer the query, state that clearly.")
# Prompt writing with the help of qwen3:14b
cat(q_text)
Based on the following information: Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels 
Answer the query: What animals are llamas related to? 
Use only the information provided. Do not reference external knowledge, assumptions, or sources not explicitly included in the given text. If the information is insufficient to answer the query, state that clearly.

Then, we use the prompt to query the LLM, e.g. llama3.2:3b in our case,

rollama::query(q_text, model = "llama3.2:3b",
      screen = F, output = "text") |> cat()
According to the given information, llamas are members of the camelid family and are therefore related to:

1. Vicunas
2. Camels

So, instead of spitting out the original text “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels”, the LLM synthesize the information and answer the query.

7.4 RAG from documents

The following are key RAG stages according to https://en.wikipedia.org/wiki/Retrieval-augmented_generation#RAG_key_stages:

  1. Indexing
  2. Retrieval
  3. Augmentation
  4. Generation

These are the ingredients for a RAG pipeline. We are going to use the pipeline outlined in this IBM’s webpage: https://www.ibm.com/architectures/patterns/genai-rag. The conceptual architecture provided in the webpage is simplified below:

[figure here – to be added]

7.4.1 Indexing

Prepare text data

  • PDF, HTML etc into text
  • Chunk/split the text (using appropriate chunking methods) into smaller texts of specific length
  • This process allows retrieval of relevant text chunks instead of whole text

Generate embeddings

  • Chunked texts into embeddings

Store in vector DB

  • Store these embeddings in the database

7.4.2 Retrieval

  • query into embedding
  • use embedding to search relevant records/vectors
  • retrieve relevant records

7.4.3 Augmentation

  • along with query, feed the records as context to LLM
  • write suitable prompt for query + relevant texts

7.4.4 Generation

  • LLM generate response based on provided context

7.5 RAG with citations

7.6 Deep-dive: How it works

In progress …