5 Embedding Generation
Embedding is a crucial component in LLMs. It that converts words into numerical vectors to preserve semantic meaning, capture relationships, and enable contextual understanding for the machines. It can used in retrieval augmented generation (RAG, see: https://ollama.com/blog/embedding-models) and supervised learning (see: https://jbgruber.github.io/rollama/articles/text-embedding.html).
In this chapter, we will learn about how to transform raw text data into numerical representations and we will investigate the role of similarity indices in the context of embeddings.
5.1 Preliminaries
Load these packages,
Make sure that your Ollama is running,
ping_ollama() # ensure it's running
▶ Ollama (v0.5.1) is running at <http://localhost:11434>!
5.2 Understanding numerical embedding
5.2.1 Text to embedding
The example and text were taken from https://ollama.com/blog/embedding-models. For our understanding, let’s retrieve back the relevant text (i.e. related in semantic meaning).
We convert the strings into vectors via nomic-embed-text
model.
docs = c(
"Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels",
"Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
"Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
"Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
"Llamas are vegetarians and have very efficient digestive systems",
"Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old"
)
nums = embed_text(docs, "nomic-embed-text")
nomic-embed-text
generates 768 dimensions for each of the six text strings.
For example “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels” becomes
docs[1]
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"
nums[1,] |> round(2) |> as.numeric()
[1] 0.55 0.35 -3.61 -1.24 0.44 -0.51 -1.70 -1.54 -0.69 -0.37 -0.55 1.28
[13] 1.49 0.42 0.38 -0.48 -1.03 -0.55 1.03 0.51 -1.85 0.79 0.66 -0.23
[25] 1.22 -0.16 -0.60 0.77 0.24 -0.29 2.44 -0.14 0.81 -0.52 0.01 0.68
[37] 1.52 0.39 1.84 0.24 1.31 -0.22 0.21 0.55 0.74 -0.52 0.68 0.90
[49] 2.03 -0.09 1.20 -0.43 -0.87 0.15 0.82 0.08 0.80 0.08 0.28 -0.71
[61] 1.68 -0.08 -0.38 1.99 0.61 -0.68 -1.26 0.32 -0.54 0.12 0.54 -0.06
[73] 1.10 -0.13 -0.19 0.93 1.12 -0.21 -0.38 -0.74 1.82 0.35 0.59 0.13
[85] 0.92 0.27 -1.77 -0.21 -0.19 0.55 0.13 0.53 1.44 0.67 -1.01 1.22
[97] -0.89 0.13 -0.51 -0.81 -0.86 -0.46 1.22 0.18 0.79 -0.03 0.53 -0.62
[109] -0.27 -0.41 -1.52 0.06 -0.12 -0.45 -1.11 -0.97 0.78 -0.89 0.82 0.72
[121] -0.60 -0.92 -0.26 0.80 0.43 0.09 -0.25 0.30 0.97 -0.95 -0.44 -0.53
[133] 0.10 -0.47 0.34 -0.03 0.00 -0.28 0.63 0.67 0.18 1.06 0.47 -0.71
[145] 0.36 -0.83 0.80 -0.93 -1.01 -0.55 0.27 0.95 -0.48 0.33 -0.28 -0.67
[157] -0.43 -0.36 -0.28 0.05 0.45 -0.21 0.21 -0.11 -0.42 -0.55 1.75 0.64
[169] 1.62 1.62 -1.39 -0.39 -0.04 -0.07 1.07 -0.38 0.26 -0.79 -0.18 -0.67
[181] -0.31 -1.42 -0.11 1.34 -0.86 0.10 -0.09 -0.22 -1.07 -1.20 -0.72 1.76
[193] 0.73 -1.23 -0.28 -0.20 0.08 -0.37 0.21 -0.91 -1.09 0.18 -1.85 0.49
[205] -1.07 0.63 -0.15 -1.21 -0.49 0.81 0.48 -0.30 -0.40 0.02 -0.49 -0.82
[217] -0.11 -0.37 -0.40 1.24 0.58 0.79 1.28 -0.79 1.58 0.19 -1.26 0.43
[229] -0.61 0.23 0.20 -0.06 1.42 -0.31 0.66 0.14 0.49 0.46 -0.16 -0.16
[241] -0.43 -0.61 -0.14 -0.31 -1.36 -0.11 0.35 -0.97 -0.37 1.70 0.30 -0.95
[253] 0.62 0.17 0.41 -0.09 -0.21 -0.59 0.04 -0.21 0.36 -0.81 0.77 -0.59
[265] -0.54 -0.65 0.15 -0.07 -0.24 -0.82 -0.51 0.83 1.12 -0.43 0.11 -0.09
[277] -0.30 -0.49 -0.62 -0.20 1.49 -1.27 -0.64 -0.50 -0.05 0.72 0.98 0.30
[289] 0.28 -0.09 0.47 0.78 0.17 1.53 0.73 -0.21 2.04 0.64 -0.25 0.48
[301] -0.10 0.00 1.19 0.26 0.09 0.07 0.22 -0.68 0.99 -0.71 -0.27 -0.10
[313] -0.33 0.14 -0.19 1.00 -0.06 0.20 1.07 1.08 -0.10 -1.09 -0.42 -1.16
[325] 0.88 0.27 -0.69 0.43 0.06 -1.00 -0.22 1.15 1.23 0.12 -0.23 1.14
[337] -1.10 -0.27 -0.31 0.45 0.73 -0.49 1.64 -0.83 -1.28 0.60 -0.42 -0.11
[349] 0.43 -0.79 -0.63 0.13 0.41 0.34 0.60 -0.93 0.31 -0.15 0.47 -0.15
[361] 0.98 -0.15 -0.18 -0.51 0.54 -0.17 0.36 0.29 0.93 -0.32 -0.63 0.62
[373] -0.73 0.03 0.75 0.58 -1.01 -0.94 -1.89 -0.68 0.45 -0.13 0.39 1.16
[385] 0.54 -0.04 0.58 -0.66 0.05 0.45 -0.06 -0.18 -1.43 -0.69 0.06 -0.55
[397] 1.61 0.05 -0.82 1.60 -0.89 -0.79 0.06 0.11 0.37 0.94 -0.51 -0.93
[409] 1.22 0.13 1.02 0.18 0.94 -0.19 -0.28 2.08 -0.07 0.59 -0.45 -0.11
[421] -0.05 0.97 -0.13 0.75 0.09 0.27 -0.13 0.50 -0.05 -0.55 0.22 1.17
[433] 1.48 0.03 0.50 -0.58 -0.08 0.93 0.35 -0.08 1.15 -0.12 -0.66 -0.60
[445] 0.79 1.91 0.71 -0.43 0.08 0.98 -0.08 0.72 0.57 -0.11 1.30 -0.57
[457] -0.22 -0.51 1.62 0.37 0.84 0.17 -0.46 0.95 0.32 -0.71 0.50 0.38
[469] -0.35 0.82 -0.60 0.69 -0.37 -0.42 -1.42 -0.62 -0.31 0.32 0.87 0.71
[481] 0.78 0.30 -1.00 -2.85 0.26 0.94 0.61 -0.40 -0.32 0.42 0.37 -0.58
[493] 0.68 0.35 0.01 -1.25 -0.15 0.23 2.00 0.59 0.42 1.33 -0.25 -0.63
[505] 1.01 -0.19 0.10 -1.10 -0.72 0.41 -1.30 0.21 -0.99 0.60 1.20 -1.17
[517] 0.92 0.29 -0.30 0.08 -0.34 -0.63 0.03 -0.46 -0.76 -0.25 0.00 -0.91
[529] -0.04 -0.18 -0.63 -0.25 -0.45 -0.06 -0.54 -0.50 0.28 1.48 0.83 -0.20
[541] 0.76 -0.91 -0.36 0.55 0.07 0.26 -0.48 -0.38 0.16 0.25 -0.37 -1.46
[553] -0.11 -1.15 -0.05 -0.94 -0.13 0.12 -0.73 -0.16 0.19 0.35 0.43 -0.54
[565] -0.79 -0.88 0.00 0.05 0.23 0.31 0.41 -1.34 -0.17 0.14 0.88 -0.58
[577] 0.10 -0.17 -0.69 -0.73 0.01 -1.89 -1.06 0.26 -0.38 0.52 -0.70 -0.30
[589] -0.20 -0.63 0.51 0.92 -1.04 -0.87 0.08 -0.57 0.27 -0.76 -0.50 -1.14
[601] 0.85 -1.10 1.52 0.07 -1.04 0.53 0.09 0.96 0.21 -0.36 0.75 0.47
[613] -0.39 -0.37 0.47 -0.34 0.23 -1.42 -1.14 0.16 -0.74 -0.69 -0.23 0.30
[625] -0.37 0.70 -2.32 -0.24 0.48 0.41 -1.29 -0.18 -0.72 -1.15 -0.28 -0.04
[637] -1.45 -0.37 -0.20 0.79 0.52 -0.54 -0.57 1.03 1.04 -0.67 0.09 0.21
[649] 0.47 0.14 1.91 1.43 0.86 0.73 0.17 -0.54 0.59 -1.44 -1.07 -0.21
[661] -0.33 1.20 -0.63 0.60 -0.25 0.15 0.02 -1.13 -1.60 0.86 0.39 -0.64
[673] -0.10 -0.79 -0.13 -0.39 0.34 0.33 1.76 -0.16 0.25 -0.32 -0.72 -0.23
[685] -0.60 -0.72 0.93 -1.13 0.09 -1.17 -0.48 -1.45 0.00 -0.46 0.17 -0.04
[697] 0.24 -0.01 -0.32 1.57 -0.70 0.98 0.21 1.02 0.46 -0.67 0.06 1.69
[709] 0.18 -0.28 -1.10 1.07 -0.46 1.69 1.07 -0.61 -0.71 -0.06 0.01 0.22
[721] 0.76 -0.66 0.67 -0.29 -0.45 -1.13 0.95 -2.16 -0.04 -0.69 0.59 -0.21
[733] -0.30 0.50 -0.30 0.10 -0.79 0.28 0.66 -0.06 -1.20 -0.13 -0.20 0.40
[745] 1.09 -1.69 -0.52 1.23 0.20 0.07 0.49 -0.82 0.07 -0.24 0.31 0.29
[757] -0.92 0.67 0.94 -0.24 1.26 -0.13 0.99 0.30 -1.40 -1.10 -2.36 -0.93
and “Llamas are vegetarians and have very efficient digestive systems” becomes
docs[5]
[1] "Llamas are vegetarians and have very efficient digestive systems"
nums[5,] |> round(2) |> as.numeric()
[1] 0.03 1.86 -3.79 -0.37 0.44 -0.65 -1.73 -0.90 -1.30 0.02 0.49 1.28
[13] 1.11 0.85 0.59 -0.22 -0.32 -1.05 0.78 0.74 -1.58 1.00 0.29 -0.65
[25] 1.49 0.18 -0.96 2.20 0.39 -0.15 1.04 -1.08 -0.28 -0.06 0.43 0.43
[37] 2.19 0.17 1.59 -0.63 1.07 0.15 0.77 -0.10 0.67 0.19 0.96 0.30
[49] 0.89 -1.15 0.03 -0.94 -1.12 0.32 1.27 -0.63 0.05 -0.24 -0.03 -1.02
[61] 1.16 -0.15 -0.85 2.18 0.04 -1.57 -0.73 0.09 -0.44 -0.05 0.96 -0.93
[73] 0.28 -0.80 0.90 0.74 0.05 -0.05 0.02 -0.33 0.85 0.33 0.88 -0.66
[85] 1.09 -0.12 -0.44 -1.13 -0.04 0.52 0.20 1.14 1.54 0.67 -1.19 0.92
[97] -0.88 0.31 -0.32 -0.29 -0.39 -0.46 1.09 -0.18 1.15 -0.25 1.34 -0.73
[109] -0.49 -0.02 -0.35 0.17 0.07 -0.23 -0.23 -1.14 1.83 -0.66 0.88 0.70
[121] -0.14 -0.28 0.54 0.17 0.61 0.53 -0.27 0.08 0.52 -0.86 -0.40 0.49
[133] 0.30 -1.06 0.07 -0.08 -0.64 -0.35 0.14 0.29 0.81 0.50 0.42 0.09
[145] -0.12 -1.55 0.70 -0.78 -0.50 0.29 0.60 1.66 0.31 0.96 -0.52 -0.77
[157] -0.28 -0.17 -0.27 -0.12 0.69 0.49 0.08 0.45 -0.15 -0.19 2.00 0.59
[169] 0.84 1.58 -1.30 -0.53 -0.14 0.07 0.67 -0.69 0.18 -1.03 0.85 -0.45
[181] 0.14 -0.83 0.46 1.48 -0.40 -0.07 0.40 -0.24 -1.41 -1.59 -0.71 1.83
[193] 0.37 -1.35 -0.37 0.37 0.15 -0.82 0.35 -1.28 -1.23 -0.74 -2.09 0.30
[205] -1.20 0.37 -0.52 -0.79 -0.62 0.39 0.63 -0.19 -0.09 0.10 0.02 -0.35
[217] -0.91 0.05 0.33 0.96 1.01 0.31 0.86 -0.67 1.21 -0.42 -1.48 -0.18
[229] -1.22 0.38 0.45 -0.11 1.46 -0.38 0.14 0.20 0.45 0.71 0.68 -0.61
[241] -0.31 -0.56 -0.07 0.18 -0.79 -0.06 0.32 -1.06 -0.54 1.60 0.99 -0.94
[253] 0.38 -0.38 0.00 -0.13 0.10 -0.73 1.08 -0.65 0.36 -0.89 0.30 0.32
[265] -0.05 -0.71 0.33 0.60 -0.29 -1.26 -0.31 -0.33 0.77 -0.17 0.11 0.23
[277] -0.44 -0.52 -0.12 0.27 0.92 -1.72 -0.91 0.51 -0.24 0.62 0.65 -0.28
[289] -0.07 -0.30 0.38 0.92 -0.23 1.58 -0.21 0.15 1.97 0.08 -0.07 -0.26
[301] -0.09 0.28 0.98 0.88 -0.22 0.50 0.59 -0.40 1.02 -0.34 0.11 0.12
[313] -0.57 -0.02 -0.45 0.13 -0.22 0.38 1.51 1.08 0.57 -1.07 -0.69 -1.50
[325] 0.66 0.42 -0.19 0.45 0.29 -1.02 0.16 1.35 1.13 -0.49 -0.64 0.61
[337] -0.50 0.14 -0.56 0.20 0.01 -0.73 1.42 -0.68 -1.00 0.04 -0.40 -0.78
[349] 0.16 -0.46 -0.04 0.06 0.61 0.96 -0.40 -0.57 0.41 -0.60 0.73 0.07
[361] 0.56 -0.63 0.04 -0.65 0.68 0.42 0.26 -0.68 0.32 -0.17 -0.32 -0.06
[373] -0.94 0.32 0.36 0.47 -1.17 -0.89 -1.50 -0.21 1.06 -0.81 -0.19 0.42
[385] 0.14 -0.76 0.49 -0.56 0.08 0.68 0.33 -1.21 -1.43 -0.03 -0.03 -0.44
[397] 2.03 -0.58 0.12 1.87 -0.46 -0.85 0.05 -0.09 0.00 0.62 -0.25 -0.70
[409] 0.61 -0.15 1.03 1.07 1.02 -0.37 -0.67 1.28 -0.30 -0.28 -0.46 -0.33
[421] -0.32 1.06 0.19 0.39 -0.15 0.29 0.00 1.53 0.06 -0.65 0.48 0.23
[433] 1.36 0.54 0.48 0.53 -1.32 0.49 0.72 0.99 0.78 -0.36 -0.81 -1.56
[445] 0.59 1.46 0.62 -0.98 -0.06 -0.43 1.22 1.28 0.52 0.36 1.53 -0.15
[457] -0.78 -0.45 0.95 0.29 -0.13 0.05 -0.75 0.52 -0.81 -0.10 0.95 0.95
[469] 0.24 0.54 -0.27 0.56 -0.24 -0.59 -1.32 -0.38 -0.62 0.43 0.67 0.94
[481] 0.65 -0.14 -0.46 -1.91 0.59 0.50 0.95 -1.03 0.23 0.45 -0.47 0.46
[493] 0.44 -0.09 -0.05 -0.79 0.18 -0.31 0.77 0.67 -0.28 0.34 0.10 -0.85
[505] 0.61 -0.43 0.18 -0.88 -0.74 0.87 -0.61 0.67 -0.38 0.55 1.30 -1.04
[517] 1.02 0.38 -0.85 0.40 0.38 -0.28 0.33 -0.11 -0.77 0.00 0.10 -0.90
[529] 0.30 -0.14 -0.32 -0.34 -0.22 0.57 -1.03 -0.40 0.25 1.95 0.80 -0.56
[541] 0.93 -0.71 -0.47 0.42 0.04 0.22 -1.44 -0.53 -0.77 -0.25 -1.06 -1.64
[553] 0.58 -1.38 0.81 -0.53 -0.23 0.93 -0.37 -0.58 0.79 0.69 1.22 -0.92
[565] -0.76 -1.46 0.03 -0.33 0.35 0.18 0.53 -0.37 0.54 0.56 0.57 -0.86
[577] -0.09 -0.38 -0.98 -0.73 0.57 -1.79 -0.99 0.02 0.03 0.61 -0.20 -0.12
[589] 0.14 -0.64 0.44 0.76 -0.66 -0.87 0.41 -0.27 -0.43 -0.89 -0.75 -1.21
[601] 0.61 -0.83 0.94 -0.64 -0.60 0.54 -0.37 1.12 -0.09 -0.41 0.14 0.69
[613] -0.85 -1.04 0.58 -0.62 0.04 -0.99 -0.83 0.42 -0.80 -1.02 0.49 -0.26
[625] -0.08 -0.08 -2.10 -0.17 0.01 -0.05 -1.10 0.33 -0.30 -1.38 -0.98 -0.56
[637] -0.99 -0.31 0.43 0.34 0.80 -0.02 0.00 0.55 0.89 0.59 0.36 -0.20
[649] 0.62 -0.57 1.41 1.53 1.32 0.15 0.57 0.04 0.04 -0.54 -1.09 -0.86
[661] 0.33 0.89 0.28 1.39 0.48 -0.36 -0.34 -1.13 -1.87 1.21 0.62 -0.41
[673] -0.30 -0.70 0.23 -0.01 0.57 -0.02 1.55 0.03 0.44 -0.76 -0.37 -0.19
[685] 0.03 -1.28 1.03 -1.22 0.47 -0.57 -0.02 -1.39 -0.37 -0.11 -0.11 -0.44
[697] -0.43 0.40 -0.09 1.15 -0.40 1.21 0.41 1.58 0.27 -1.64 0.25 1.16
[709] 0.01 0.16 -0.66 0.40 -0.89 1.90 1.37 -0.51 -1.52 -0.78 0.18 0.24
[721] 1.25 0.03 0.10 -0.38 -0.06 -0.99 1.14 -1.95 0.17 -0.55 0.94 -0.56
[733] -0.55 0.83 -0.31 0.18 -0.83 0.40 -0.15 0.24 -1.01 0.41 -0.23 0.39
[745] 1.50 -0.71 -0.62 0.51 1.06 0.04 0.83 -1.67 -0.98 -0.16 -0.30 0.42
[757] -0.29 0.82 0.86 0.11 1.04 0.23 1.23 -0.35 -1.74 -1.36 -1.54 -0.83
We will also look at the use of numerical embeddings later in Chapter 6 in the context of text classification.
5.2.2 Similarity index
Now, once we have a related query, how can we find related text strings (based on the numerical values)?
q_text1 = "What animals are llamas related to?"
q_text2 = "How long llamas can live?"
For the first query, “What animals are llamas related to?”, we transform it into embedding, and combine it with the embeddings that we obtained earlier,
q_num = embed_text(q_text1, "nomic-embed-text") # turn into numerical vector
mat = rbind(as.numeric(q_num), as.matrix(nums)) # bind query text with existing texts
mat = t(mat) # need to change rows to columns for cosine function in lsa
mat |> round(2) |> head()
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
dim_1 0.33 0.55 0.65 0.97 0.05 0.03 0.55
dim_2 0.25 0.35 -0.02 1.28 0.73 1.86 -0.13
dim_3 -3.44 -3.61 -4.03 -3.66 -3.90 -3.79 -3.93
dim_4 -0.93 -1.24 -0.80 -0.77 -1.20 -0.37 -1.14
dim_5 0.68 0.44 0.33 -1.24 -0.30 0.44 -0.65
dim_6 0.04 -0.51 0.17 0.68 0.72 -0.65 1.10
Then, calculate cosine similarity index between our query and other text strings. We use cosine()
from lsa
package. We combine the numbers with text to make sense of the link between text strings and their corresponding embeddings,
library(lsa) # for cosine similarity
cos_sim = cosine(mat) # cosine similarities between the vectors
df = data.frame(Text = c(q_text1, docs), Similarity = cos_sim) # combined into df
df = df |> dplyr::arrange(desc(Similarity.1)) # arrange with respect to Similarity.1
df # i.e. our query
# A tibble: 7 × 8
Text Similarity.1 Similarity.2 Similarity.3 Similarity.4 Similarity.5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 What animals… 1 0.899 0.759 0.725 0.777
2 Llamas are m… 0.899 1 0.754 0.715 0.749
3 Llamas are v… 0.800 0.816 0.753 0.718 0.802
4 Llamas weigh… 0.777 0.749 0.763 0.811 1
5 Llamas were … 0.759 0.754 1 0.721 0.763
6 Llamas can g… 0.725 0.715 0.721 1 0.811
7 Llamas live … 0.724 0.723 0.786 0.793 0.798
# ℹ 2 more variables: Similarity.6 <dbl>, Similarity.7 <dbl>
which shows that “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels” is highly similar (similarity = 0.899).
For the second query, “How long llamas can live?”
q_num = embed_text(q_text2, "nomic-embed-text")
mat = rbind(as.numeric(q_num), as.matrix(nums))
mat = t(mat)
mat |> round(2) |> head()
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
dim_1 0.88 0.55 0.65 0.97 0.05 0.03 0.55
dim_2 -0.11 0.35 -0.02 1.28 0.73 1.86 -0.13
dim_3 -3.62 -3.61 -4.03 -3.66 -3.90 -3.79 -3.93
dim_4 -0.94 -1.24 -0.80 -0.77 -1.20 -0.37 -1.14
dim_5 -0.50 0.44 0.33 -1.24 -0.30 0.44 -0.65
dim_6 0.42 -0.51 0.17 0.68 0.72 -0.65 1.10
cos_sim = cosine(mat)
df = data.frame(Text = c(q_text2, docs), Similarity = cos_sim)
df = df |> dplyr::arrange(desc(Similarity.1))
df
# A tibble: 7 × 8
Text Similarity.1 Similarity.2 Similarity.3 Similarity.4 Similarity.5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 How long lla… 1 0.696 0.742 0.782 0.768
2 Llamas live … 0.915 0.723 0.786 0.793 0.798
3 Llamas can g… 0.782 0.715 0.721 1 0.811
4 Llamas weigh… 0.768 0.749 0.763 0.811 1
5 Llamas were … 0.742 0.754 1 0.721 0.763
6 Llamas are v… 0.726 0.816 0.753 0.718 0.802
7 Llamas are m… 0.696 1 0.754 0.715 0.749
# ℹ 2 more variables: Similarity.6 <dbl>, Similarity.7 <dbl>
it rightly points out that “Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old” is highly similar (similarity = 0.915).
What you have learned in this sub-section will be relevant for you to understand RAG in Chapter 7.
5.3 Deep-dive: How it works
In progress …