7  Retrieval Augmented Generation

LLMs generally have limited knowledge; they only know what was ‘taught’ to them based on their training texts. Any new information not included in these texts is beyond their generation capability. It is impractical to retrain LLMs every time we want them to generate text with new information. A simple solution is to provide the necessary information directly in our queries. However, if we do not know exactly what we are looking for and the required information might be from a thick technical manual, a more practical approach is Retrieval-Augmented Generation (RAG). In this chapter, we will explore how to perform RAG by pairing our knowledge in the form of a vector database with LLMs.

In this chapter, we will explore how to perform RAG by pairing our knowledge in the form of a vector database with LLMs. So far, most resources I have found are in Python, and I aim to achieve this using R.

7.1 About

[deepseek-r1:7b]

RAG is an approach that enhances LLMs by augmenting their capabilities with document retrieval systems. This integration allows for more accurate, relevant, and contextually informed responses compared to relying solely on the model’s internal knowledge.

How It Works:

  1. Document Retrieval: When a query is received, RAG first retrieves relevant documents or passages from an external source (e.g., a database) using techniques like embedding-based search.
  2. Context Integration: The retrieved documents are then provided to the LLM as context, enabling the model to generate responses informed by both its internal knowledge and the specific retrieved information.

Key Components:

  • Embedding-Based Retrieval: Documents are converted into vector representations, allowing efficient matching with query vectors through similarity measures.
  • Efficient Search: Techniques like vector databases or similarity search quickly identify relevant documents, ensuring timely retrieval without significant computational overhead.

Why Use RAG?

  1. Enhanced Accuracy: By incorporating up-to-date or specific data, RAG reduces reliance on the model’s potentially outdated or generic knowledge.
  2. Handling Specialized Queries: RAG is particularly effective for domain-specific questions requiring detailed information beyond the model’s training data.
  3. Efficiency: Focusing the model on a subset of relevant documents can optimize performance and reduce computational demands.

Use Cases:

  • Customer service chatbots needing access to product details or customer records.
  • Healthcare applications where accurate medical guidelines are crucial.

RAG differs from fine-tuning (which involves retraining the model) or prompt engineering (which relies on phrasing) by adding external context without altering the model itself. RAG effectively bridges the gap between LLM general knowledge and specific, detailed information, offering more accurate and relevant responses through augmented input.

7.2 Basic RAG with Chroma

We learned about text embedding and how we can utilize it to find texts related to our query in Chapter 5. Now, in the context of RAG, let’s go back to the text in Section 5.2.1 (https://ollama.com/blog/embedding-models). Suppose that now, we want to ask a query, find the related text, then let an LLM answer our query based on the information (retrieved text) given to it. We will use Chroma (https://www.trychroma.com/) and rchroma (https://cynkra.github.io/rchroma/) for that. Make sure you have a working Chroma installation (client-server mode https://docs.trychroma.com/docs/run-chroma/client-server, docker https://docs.trychroma.com/production/containers/docker) accessible at localhost:8000 (e.g. http://localhost:8000/api/v2). We will also used our beloved rollama for generating embeddings and text generation using LLMs of our choice.

Load these packages,

Make sure that your Ollama is running,

ping_ollama()  # ensure it's running
▶ Ollama (v0.6.7) is running at <http://localhost:11434>!

Make sure Chroma server is running. In my case, I am using the client-server mode, and I check its status at http://localhost:8000/api/v2. If you are running Chroma via Docker, you may check by chroma_docker_running().

We create a Chroma client connection,

client = chroma_connect()

and check its status using heartbeat(),

heartbeat(client)
[1] 1.747906e+18

7.2.1 Setting up ChromaDB collection

Next, we create a ChromaDB collection named “my_collection”,

create_collection(client, "my_collection")
$id
[1] "4d2a6726-2c57-4b20-bb26-f1a52dc2079f"

$name
[1] "my_collection"

$configuration_json
$configuration_json$hnsw
$configuration_json$hnsw$space
[1] "l2"

$configuration_json$hnsw$ef_construction
[1] 100

$configuration_json$hnsw$ef_search
[1] 100

$configuration_json$hnsw$max_neighbors
[1] 16

$configuration_json$hnsw$resize_factor
[1] 1.2

$configuration_json$hnsw$sync_threshold
[1] 1000


$configuration_json$spann
NULL

$configuration_json$embedding_function
NULL


$metadata
NULL

$dimension
NULL

$tenant
[1] "default_tenant"

$database
[1] "default_database"

$log_position
[1] 0

$version
[1] 0

Chroma and rchroma currently uses “l2” similarity index as the default distance function.

In order to add text documents to “the collection”my_collection”, we will use add_documents() function. To properly use the function, it involves several steps. First, we prepare the text, for example we use text strings from https://ollama.com/blog/embedding-models,

documents = c(
  "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old"
)

and convert the text into text embeddings via nomic-embed-text embedding model,

embeddings = embed_text(documents, "nomic-embed-text")

The function expects unnamed list of numeric vectors for the embeddings option, therefore we have to convert the embeddings from tibble to unnamed list,

unnamed_embeddings = apply(embeddings, 2, as.numeric) |> 
  split(f = 1:length(documents)) |> unname()

Lastly, we add these objects to the function,

add_documents(
  client,
  "my_collection",
  documents = documents,
  ids = paste0("doc", 1:length(documents)),
  embeddings = unnamed_embeddings
)
named list()

Note that the option ids is compulsory. So we generate the ids as combinations of “doc” and “1:length(documents)” e.g. “doc1”, “doc2”, and so on.

We can view configuration details of “my_collection”,

get_collection(client, "my_collection")
$id
[1] "4d2a6726-2c57-4b20-bb26-f1a52dc2079f"

$name
[1] "my_collection"

$configuration_json
$configuration_json$hnsw
$configuration_json$hnsw$space
[1] "l2"

$configuration_json$hnsw$ef_construction
[1] 100

$configuration_json$hnsw$ef_search
[1] 100

$configuration_json$hnsw$max_neighbors
[1] 16

$configuration_json$hnsw$resize_factor
[1] 1.2

$configuration_json$hnsw$sync_threshold
[1] 1000


$configuration_json$spann
NULL

$configuration_json$embedding_function
NULL


$metadata
NULL

$dimension
[1] 768

$tenant
[1] "default_tenant"

$database
[1] "default_database"

$log_position
[1] 0

$version
[1] 0

7.2.2 Retrieving information from ChromaDB collection

Before we try out basic RAG, let’s look at how to retrieve/query the collection. We set up the query text and its embedding,

query_text = "What animals are llamas related to?"
query_embedding = embed_text(query_text, "nomic-embed-text:latest") |> as.numeric()

Then, we query “my_collection”. Note we use rchroma::query() because it overlaps with rollama’s query() function,

result = rchroma::query(
  client,
  "my_collection",
  query_embeddings = list(query_embedding),
  n_results = 3
)

where n_results is the number of results to return per query. We set it to “3”.

We can view these three results and their distances,

result$documents[[1]]
[[1]]
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"

[[2]]
[1] "Llamas are vegetarians and have very efficient digestive systems"

[[3]]
[1] "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight"
result$distances[[1]]
[[1]]
[1] 98.88392

[[2]]
[1] 194.9566

[[3]]
[1] 214.2481

Specifically, we are interested in the first result,

output = result$documents[[1]][[1]]
output
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"

7.2.3 Basic RAG

Now, we integrate the returned query output in our prompt. We can write our basic prompt like this,

q_text = paste("Based on the following information:",
               output,
               "\nAnswer the query:",
               query_text,
               "\nUse only the information provided. Do not reference external knowledge, assumptions, or sources not explicitly included in the given text. If the information is insufficient to answer the query, state that clearly.")
# Prompt writing with the help of qwen3:14b
cat(q_text)
Based on the following information: Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels 
Answer the query: What animals are llamas related to? 
Use only the information provided. Do not reference external knowledge, assumptions, or sources not explicitly included in the given text. If the information is insufficient to answer the query, state that clearly.

Then, we use the prompt to query the LLM, e.g. llama3.2:3b in our case,

rollama::query(q_text, model = "llama3.2:3b",
      screen = F, output = "text") |> cat()
Based on the provided information, it can be concluded that:

Llamas are related to vicunas and camels.

However, the information does not provide a comprehensive list of all animals that llamas are related to. It only mentions that they are part of the camelid family, but does not specify other relatives.

So, instead of spitting out the original text “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels”, the LLM synthesize the information and answer the query.

7.3 Deep-dive: How it works

In progress …