1 Preliminaries

In the following examples, I give the ones that I think relevant to academicians and researchers in medical and health sciences. You may try some other use cases once you grasp the basics of using rollama (or LLMs in general).

1.1 Libraries

library(rollama)
library(purrr)
library(tibble)

1.2 Presets

ping_ollama()  # ensure it's running
list_models()$name
[1] "qwen2.5-coder:latest"                                                          
[2] "hf.co/RichardErkhov/mesolitica_-_malaysian-llama-3-8b-instruct-16k-gguf:Q4_K_M"
[3] "moondream:latest"                                                              
[4] "moondream:latest"                                                              
[5] "llama3.2:latest"                                                               
[6] "nomic-embed-text:latest"                                                       
[7] "llama3.2:3b"                                                                   
model_text = "llama3.2"
model_image = "moondream"
model_embed = "nomic-embed-text"
model_malaysia = "hf.co/RichardErkhov/mesolitica_-_malaysian-llama-3-8b-instruct-16k-gguf:Q4_K_M"

2 Interpret statistical results

Run linear regression

lm_model = lm(mtcars$mpg ~ mtcars$wt)
summary(lm_model)

Call:
lm(formula = mtcars$mpg ~ mtcars$wt)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
mtcars$wt    -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Save the output into suitable text format

text_data = summary(lm_model) |> capture.output() |> paste(collapse = "\n")
text_data |> cat()

Call:
lm(formula = mtcars$mpg ~ mtcars$wt)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
mtcars$wt    -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Put everything together nicely in a tibble

q_text = tribble(
  ~role,    ~content,
  "system", "Interpret the results from the given text:",
  "user",   text_data
)
q_text
# A tibble: 2 × 2
  role   content                                                                
  <chr>  <chr>                                                                  
1 system "Interpret the results from the given text:"                           
2 user   "\nCall:\nlm(formula = mtcars$mpg ~ mtcars$wt)\n\nResiduals:\n    Min …

Query to interpret the statistical result

query(q_text, model_text, output = "text") |> cat()
**Regression Analysis Results**

The given text presents the output from a linear regression analysis using the `lm()` function in R, where the dependent variable is `mpg` (miles per gallon) and the independent variable is `wt` (weight of the car).

**Key Takeaways:**

1. **Model Summary**: The model has a strong positive relationship between `mpg` and `wt`, with an adjusted R-squared value of 0.7446, indicating that about 74.46% of the variation in `mpg` is explained by `wt`.
2. **Coefficients**: The estimated coefficient for `wt` is -5.3445, which means that for every additional unit increase in weight (e.g., from 2000 to 3000 pounds), `mpg` decreases by approximately 5.35 miles per gallon.
3. **Standard Error and t-value**: The standard error of the estimate is 0.5591, and the t-value is -9.559, indicating that the coefficient for `wt` is statistically significant (p < 0.001).
4. **Residual Analysis**: The residual standard error is 3.046, which measures the spread of the residuals around the fitted line.
5. **Multiple R-squared and F-statistic**: The multiple R-squared value indicates that about 75.28% of the variation in `mpg` can be explained by a linear combination of `wt`. The F-statistic (91.38) is also significant, indicating that the model provides a good fit to the data.

**Interpretation and Suggestions:**

* As expected, there is a strong negative relationship between `wt` and `mpg`, suggesting that heavier cars tend to have lower fuel efficiency.
* However, the adjusted R-squared value is slightly less than 1, indicating that other variables may be contributing to the variation in `mpg`.
* It would be interesting to explore additional predictors, such as engine size or transmission type, to see if they can explain more of the variation in `mpg`.

Overall, this regression analysis provides strong evidence for a linear relationship between weight and miles per gallon, but suggests that there may be other factors at play that could improve the model's explanatory power.

3 Interpret plot

Plot MPG from the dataset

hist(mtcars$mpg, main = "Histogram of MPG", xlab = "MPG")

Save as “img.png”

image_name = "img.png"
png(image_name)
hist(mtcars$mpg, main = "Histogram of MPG", xlab = "MPG")
dev.off()
png 
  2 

Query the image

query("Describe the given image:", model_image, images = image_name,
      output = "text") |> cat()

The image features a graph with a bar chart, displaying the frequency of different MPGs. The x-axis represents various MPG values, while the y-axis shows their corresponding frequencies. There are multiple bars on the graph, each representing an MPG value and its frequency. The bars are arranged in ascending order from left to right, allowing for easy comparison between the different MPG values.

4 Write knowledge questions

Write the LLM role and user query

system_instruction = "
You are an expert in public health. Currently you are tasked to write questions 
according to given query. Give answers to each question. Take into account 
target population of the questions.
"
user_query = "
Write 10 knoweldge questions about malaria for general public with low education level.
"

Combine in a tibble

q_text = tribble(
  ~role,    ~content,
  "system", system_instruction,
  "user",   user_query
)
q_text
# A tibble: 2 × 2
  role   content                                                                
  <chr>  <chr>                                                                  
1 system "\nYou are an expert in public health. Currently you are tasked to wri…
2 user   "\nWrite 10 knoweldge questions about malaria for general public with …

Ask the LLM. Look at Question 2, what’s wrong? That’s why we must evaluate texts given by LLM carefully.

query(q_text, model_text, output = "text") |> cat()
Here are 10 knowledge questions about malaria suitable for a general public with low education level:

1. What is malaria?
a) A type of food poisoning
b) A disease that affects the eyes
c) A sickness caused by a tiny parasite that is spread by mosquitoes
d) A type of skin cancer

Answer: c) A sickness caused by a tiny parasite that is spread by mosquitoes

2. Which mosquito is most likely to spread malaria?
a) Housefly
b) Mosquito
c) Sand fly
d) Tsetse fly

Answer: b) Mosquito

3. What are the symptoms of malaria?
a) Fever, chills, and headache
b) Diarrhea, vomiting, and stomach pain
c) Cough, runny nose, and sore throat
d) Muscle weakness, joint pain, and fatigue

Answer: a) Fever, chills, and headache

4. How is malaria typically spread?
a) Through touch or contact with an infected person
b) Through eating contaminated food or water
c) Through the bite of an infected mosquito
d) Through breathing in infected air

Answer: c) Through the bite of an infected mosquito

5. Can anyone get malaria?
a) No, it only affects children and pregnant women
b) Yes, but mostly people who live in tropical areas
c) Only people with weak immune systems
d) Everyone can get malaria if bitten by an infected mosquito

Answer: d) Everyone can get malaria if bitten by an infected mosquito

6. How is malaria diagnosed?
a) By looking at the patient's symptoms and medical history
b) By performing a blood test to check for the parasite
c) By examining the patient's stool or urine
d) By doing a physical examination of the patient's body

Answer: b) By performing a blood test to check for the parasite

7. What is the treatment for malaria?
a) Taking antibiotics
b) Drinking plenty of water and resting
c) Taking antimalarial medication such as chloroquine or artemisinin
d) Getting a vaccine to prevent it from happening again

Answer: c) Taking antimalarial medication such as chloroquine or artemisinin

8. Can malaria be prevented?
a) Yes, by wearing long-sleeved clothes and applying insecticide
b) No, it's only a matter of luck if you get bitten by an infected mosquito
c) Only pregnant women can prevent malaria with special medication
d) By eating more fruits and vegetables

Answer: a) Yes, by wearing long-sleeved clothes, applying insecticide, and taking preventive measures such as using bed nets

9. How can I protect myself from mosquitoes that may carry malaria?
a) Wear dark colors to blend in with the surroundings
b) Avoid going outside during peak mosquito hours
c) Use a net around your bed to keep mosquitoes away
d) Use insecticide on your skin and clothes

Answer: c) Use a net around your bed to keep mosquitoes away and b) Avoid going outside during peak mosquito hours

10. Who is most at risk of getting malaria?
a) People who live in urban areas
b) Children under 5 years old
c) Pregnant women
d) All of the above

Answer: d) All of the above, especially those living in tropical and subtropical regions

5 Write knowledge questions in Malay

Write the LLM role and user query

system_instruction = "
Anda pakar dalam kesihatan awam. Anda ditugaskan untuk menulis soalan. 
Beri jawapan kepada setiap soalan yang anda tulis. Ambil kira populasi sasaran soalan.
"
user_query = "
Tulis 10 soalan pengetahuan tentang demam denggi untuk orang awam.
"

At the moment, for this Malaysian LLM, you have to simplify the instructions as it is still in its infancy. You will also notice it can understand English as this was based on LLama 3 (note the model’s name). You can also explore MaLLaM model (https://mesolitica.com/mallam).

Combine in a tibble

q_text = tribble(
  ~role,    ~content,
  "system", system_instruction,
  "user",   user_query
)
q_text
# A tibble: 2 × 2
  role   content                                                                
  <chr>  <chr>                                                                  
1 system "\nAnda pakar dalam kesihatan awam. Anda ditugaskan untuk menulis soal…
2 user   "\nTulis 10 soalan pengetahuan tentang demam denggi untuk orang awam.\…

Ask the Malaysian LLM

query(q_text, model_malaysia, output = "text") |> cat()

Sudah tentu, saya boleh membantu. Berikut adalah beberapa soalan dan jawapan mengenai demam denggi:

1. Apakah demam denggi?
Demam denggi ialah penyakit yang disebabkan oleh jangkitan virus Denggi.

2. Bagaimana demam denggi merebak?
Demam denggi merebak melalui gigitan nyamuk Aedes aegypti yang membawa virus Denggi.

3. Apakah gejala demam denggi?
Gejala awal demam denggi termasuk demam tinggi, sakit kepala, ruam kulit dan loya-loya. Gejala lain termasuk keletihan, muntah-muntah, dan gangguan pernafasan.

4. Apakah faktor risiko jangkitan demam denggi?
Faktor risiko jangkitan demam denggi ialah tinggal di kawasan yang terdapat banyak nyamuk Aedes aegypti, seperti kawasan bandar atau pinggir bandar.

5. Bagaimana cara mencegah penularan demam denggi?
Cara mencegah penularan demam denggi termasuk mengelakkan gigitan nyamuk dengan memakai pakaian yang menutupi kulit, menggunakan repelan serangga, dan membersihkan tempat pembiakan nyamuk seperti kolam atau bekas air yang bertakung.

6. Bagaimana cara mengenal pasti demam denggi?
Demam tinggi adalah salah satu gejala utama demam denggi. Gejala lain termasuk ruam kulit, sakit kepala, dan gangguan pernafasan.

7. Adakah rawatan untuk demam denggi wujud?
Ya, terdapat beberapa jenis ubat yang boleh digunakan untuk merawat demam denggi, seperti parasetamol untuk mengurangkan demam dan ubat anti-radang untuk meredakan gejala.

8. Apakah risiko jangkitan semula demam denggi?
Jika anda pernah menghidap demam denggi sebelum ini, risiko jangkitan semula boleh meningkat jika tidak berhati-hati dalam menjaga kebersihan dan mencegah gigitan nyamuk.

9. Bagaimana cara menguruskan gejala demam denggi?
Adalah penting untuk memantau gejala anda dan segera mencari bantuan perubatan jika mengalami sakit kepala yang teruk, gangguan pernafasan, atau tanda-tanda komplikasi lain.

10. Adakah vaksin untuk demam denggi?
Pada masa ini, tiada vaksin berkesan tersedia untuk melindungi daripada jangkitan demam denggi. Walau bagaimanapun, penyelidikan dan pembangunan terus dijalankan untuk mencari penyelesaian yang lebih baik dalam mencegah dan mengawal demam denggi.

Saya harap jawapan ini membantu! Jika anda memerlukan maklumat lanjut, sila beritahu saya.

6 R coding

Specify your coding problem

q_text = "
Show how to perform logistic regression analysis in R programming language.
"

Ask for help. Here we try a larger model, qwen2.5-coder, a 7B model.

model_coder = "qwen2.5-coder"
query(q_text, model_coder, output = "text") |> cat()
To perform logistic regression analysis in R, you can follow these steps:

1. Load the necessary libraries:
```R
install.packages("ISLR")
library(ISLR)
```

2. Load your dataset:
```R
data <- read.csv("your_dataset.csv")
```
Replace "your_dataset.csv" with the name of your CSV file.

3. Prepare your data (if needed):
```R
# Assuming that the first column is the target variable and the rest are features.
X <- as.matrix(data[, -1])
y <- as.factor(data[, 1])

# Splitting data into training set and test set
set.seed(123) # for reproducibility
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_index, ]
y_train <- y[train_index]
X_test <- X[-train_index, ]
y_test <- y[-train_index]
```

4. Fit the logistic regression model:
```R
# Creating a glm object with family as binomial
model <- glm(y_train ~ ., data = data.frame(X_train, y_train), family = binomial)

summary(model) # This will provide a summary of the model.
```
In this code, 'glm' stands for Generalized Linear Model. The formula 'y_train ~ .' means that we are predicting the variable 'y_train' based on all other variables in our dataset.

5. Making predictions:
```R
# Predicting probabilities
predictions_probs <- predict(model, newdata = data.frame(X_test), type = "response")

# Predicting classes
predictions_classes <- ifelse(predictions_probs > 0.5, 1, 0)
```
Here, 'predict' function is used to make predictions from the model. The argument 'newdata' is a dataframe containing the features for which we want to predict the target variable.

6. Evaluating the model:
```R
# Confusion Matrix
conf_matrix <- table(Predicted = predictions_classes, Actual = y_test)

# Accuracy
accuracy <- sum(diag(conf_matrix)) / length(y_test)
```
Here, 'table' function is used to create a confusion matrix and 'diag' function extracts the diagonal of this matrix which represents the number of correct predictions. The accuracy is calculated as the sum of correct predictions divided by the total number of test samples.

Please note that this is a basic example and logistic regression might need more preprocessing steps depending on your dataset, such as handling missing values, scaling features, or encoding categorical variables. Also, you can use other model evaluation metrics like Precision, Recall, F1 Score, etc., based on your problem requirements.

7 References formatter

References to be formatted (taken from https://doi.org/10.1016/j.gore.2022.101024)

refs = "
Bekkers, S., Bot, A.G., Makarawung, D., et al., 2014. The National Hospital Discharge
Survey and Nationwide Inpatient Sample: the databases used affect results in THA
research. Clin. Orthop. Relat. Res. 472, 3441–3449.

Chubak, J., Ziebell, R., Greenlee, R.T., Honda, S., et al., 2016. The Cancer Research
Network: a platform for epidemiologic and health services research on cancer
prevention, care, and outcomes in large, stable populations. Cancer Causes Control.
27 (11), 1315–1323.

Dreyer, N.A., Tunis, S.R., Berger, M., Ollendorf, D., Mattox, P., Gliklich, R., 2010. Why
observational studies should be among the tools used in comparative effectiveness
research. Health Aff. (Millwood). 29 (10), 1818–1825.

Husereau, D., Drummond, M., Augustovski, F., et al., 2022. Consolidated health
economic evaluation reporting standards 2022 (CHEERS 2022) statement: updated
reporting guidance for health economic evaluations. Int. J. Technol. Assess. Health
Care. 38 (1).
"

Put everything together nicely in a tibble

q_text = tribble(
  ~role,    ~content,
  "system", "Convert the given references in APA 7 format. Do not comment or elaborate.",
  "user",   refs
)
q_text
# A tibble: 2 × 2
  role   content                                                                
  <chr>  <chr>                                                                  
1 system "Convert the given references in APA 7 format. Do not comment or elabo…
2 user   "\nBekkers, S., Bot, A.G., Makarawung, D., et al., 2014. The National …

Ask it to do the job

query(q_text, model_text, output = "text", 
      model_params = list(num_ctx = 2000)) |> cat()
Bekkers, S., Bot, A. G., Makarawung, D., et al., (2014). The National Hospital Discharge Survey and Nationwide Inpatient Sample: The databases used affect results in THA research. Clinical Orthopaedics and Related Research, 472, 3441–3449.

Chubak, J., Ziebell, R., Greenlee, R. T., Honda, S., et al., (2016). The Cancer Research Network: A Platform for Epidemiologic and Health Services Research on Cancer Prevention, Care, and Outcomes in Large, Stable Populations. Cancer Causes Control, 27(11), 1315–1323.

Dreyer, N. A., Tunis, S. R., Berger, M., Ollendorf, D., Mattox, P., Gliklich, R., (2010). Why Observational Studies Should Be Among the Tools Used in Comparative Effectiveness Research. Health Affairs, 29(10), 1818–1825.

Husereau, D., Drummond, M., Augustovski, F., et al., (2022). Consolidated health economic evaluation reporting standards 2022 (CHEERS 2022) statement: Updated Reporting Guidance for Health Economic Evaluations. International Journal of Technology Assessment in Health Care, 38(1).

Here we increase the context window to 2000. Context window/size is number of tokens that can be LLM can receive/produce as input/output. It is around 3/2 times words in a given text. Please ask your LLM companion for more information about it :-)

8 Summarize abstracts from PubMed

Get abstract from PubMed on Partial verification bias from year 2020 until 2025

library(pubmedR)
library(bibliometrix)
library(stringr)
api_key = NULL
query_pvb = "partial verification bias*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2020:2025[DP]"
res = pmQueryTotalCount(query = query_pvb, api_key = api_key)
D = pmApiRequest(query = query_pvb, limit = res$total_count, api_key = api_key)
Documents  11  of  11 
M = convert2df(D, dbsource = "pubmed", format = "api")

Converting your pubmed collection into a bibliographic dataframe

================================================================================
Done!

Flatten the abstracts into a single text string

key = grep("PARTIAL VERIFICATION BIAS", M$TI)  # Select abstract with "partial verification bias"
text_abs = str_flatten(M$AB[key])

Estimate the context size required by the LLM model

ctx = str_length(text_abs) * 3 / 2
ctx = round(ctx, -3)  # with large ctx, be careful with VRAM use
ctx
[1] 5000

Setup the query and get the result

q_text = tribble(
  ~role, ~content,
  "system", "Summarize the content of the given text.",
  "user", text_abs,
)
query(q_text, model_text, output = "text", 
      model_params = list(num_ctx = ctx)) |> cat()
The text discusses the importance of evaluating new diagnostic tests in medical care, particularly when it comes to sensitivity and specificity measures. However, these measures are often biased due to partial verification bias (PVB), where only patients who test positive for a disease receive further testing or verification, while those who test negative may not receive the same level of attention.

The article presents an investigation into using Inverse Probability Bootstrapping (IPB) sampling as a method to correct PVB in diagnostic accuracy studies. The results show that IPB is accurate for estimating sensitivity and specificity but less precise than existing methods, with a higher standard error.

Despite this limitation, the authors recommend using IPB when subsequent analysis with full data analytic methods is expected. They also suggest other methods for correcting PVB, such as applying the reference standard to all individuals who test positive or negative on an index test, and adjusting for sampling fractions in test-negative groups.

The article aims to provide a practical tutorial on how to implement these methods using R programming language, which can help researchers correct partial verification bias in diagnostic accuracy studies.

You will notice that it refers to “an article” because we combined the abstracts in a single text.

9 Understanding numerical embedding

Embedding is a crucial component in LLMs. It that converts words into numerical vectors to preserve semantic meaning, capture relationships, and enable contextual understanding for the machines. It is used in retrieval augmented generation (RAG, see: https://ollama.com/blog/embedding-models) and supervised learning (see: https://jbgruber.github.io/rollama/articles/text-embedding.html).

Let’s revisit our examples in the earlier presentation, the texts were taken from https://ollama.com/blog/embedding-models. For our understanding, we try to retrieve back the relevant text (i.e. related in semantic meaning) without using ChromaDB and Python.

We convert the strings into vectors via model.

docs = c(
  "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old"
)
nums = embed_text(docs, "nomic-embed-text")

generates 768 dimensions for each of the six text strings.

dim(nums)
[1]   6 768
head(nums)
# A tibble: 6 × 768
   dim_1   dim_2 dim_3  dim_4  dim_5  dim_6  dim_7   dim_8   dim_9  dim_10
   <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
1 0.551   0.344  -3.61 -1.23   0.433 -0.503 -1.70  -1.54   -0.690  -0.364 
2 0.647  -0.0234 -4.03 -0.805  0.335  0.171 -1.19  -0.178  -0.570   0.240 
3 0.961   1.28   -3.66 -0.773 -1.24   0.678 -0.610 -0.242  -0.874   0.138 
4 0.0560  0.729  -3.90 -1.20  -0.296  0.714 -1.18  -0.220  -1.30    0.173 
5 0.0300  1.87   -3.79 -0.370  0.441 -0.649 -1.73  -0.894  -1.30    0.0216
6 0.542  -0.127  -3.93 -1.13  -0.647  1.11  -1.14   0.0426 -0.0564  0.190 
# ℹ 758 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#   dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>,
#   dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#   dim_25 <dbl>, dim_26 <dbl>, dim_27 <dbl>, dim_28 <dbl>, dim_29 <dbl>,
#   dim_30 <dbl>, dim_31 <dbl>, dim_32 <dbl>, dim_33 <dbl>, dim_34 <dbl>,
#   dim_35 <dbl>, dim_36 <dbl>, dim_37 <dbl>, dim_38 <dbl>, dim_39 <dbl>,
#   dim_40 <dbl>, dim_41 <dbl>, dim_42 <dbl>, dim_43 <dbl>, dim_44 <dbl>, …

For example “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels” becomes

docs[1]
[1] "Llamas are members of the camelid family meaning they're pretty closely related to vicunas and camels"
nums[1,] |> round(2) |> as.numeric()
  [1]  0.55  0.34 -3.61 -1.23  0.43 -0.50 -1.70 -1.54 -0.69 -0.36 -0.55  1.29
 [13]  1.49  0.42  0.38 -0.48 -1.03 -0.55  1.03  0.51 -1.85  0.78  0.66 -0.23
 [25]  1.22 -0.16 -0.60  0.78  0.24 -0.29  2.44 -0.14  0.81 -0.51  0.01  0.68
 [37]  1.52  0.39  1.84  0.24  1.31 -0.22  0.21  0.54  0.74 -0.52  0.68  0.90
 [49]  2.03 -0.09  1.20 -0.43 -0.87  0.15  0.82  0.08  0.80  0.08  0.28 -0.71
 [61]  1.68 -0.08 -0.38  1.99  0.61 -0.68 -1.26  0.32 -0.53  0.12  0.54 -0.06
 [73]  1.10 -0.13 -0.18  0.94  1.12 -0.21 -0.38 -0.74  1.82  0.36  0.58  0.13
 [85]  0.92  0.27 -1.76 -0.21 -0.19  0.54  0.13  0.54  1.44  0.67 -1.01  1.22
 [97] -0.89  0.13 -0.51 -0.81 -0.86 -0.46  1.22  0.19  0.78 -0.03  0.52 -0.62
[109] -0.27 -0.41 -1.51  0.06 -0.12 -0.45 -1.11 -0.97  0.78 -0.90  0.83  0.72
[121] -0.60 -0.92 -0.26  0.80  0.43  0.09 -0.25  0.30  0.97 -0.95 -0.45 -0.53
[133]  0.10 -0.48  0.33 -0.03  0.00 -0.28  0.63  0.67  0.18  1.06  0.47 -0.71
[145]  0.36 -0.83  0.79 -0.94 -1.01 -0.55  0.26  0.96 -0.48  0.33 -0.28 -0.67
[157] -0.43 -0.36 -0.29  0.04  0.45 -0.21  0.21 -0.11 -0.42 -0.55  1.76  0.64
[169]  1.62  1.61 -1.39 -0.39 -0.04 -0.07  1.08 -0.37  0.26 -0.79 -0.18 -0.67
[181] -0.31 -1.42 -0.11  1.34 -0.86  0.09 -0.09 -0.22 -1.07 -1.20 -0.72  1.76
[193]  0.73 -1.24 -0.28 -0.20  0.08 -0.37  0.21 -0.91 -1.09  0.18 -1.85  0.49
[205] -1.07  0.63 -0.15 -1.21 -0.49  0.81  0.48 -0.31 -0.40  0.02 -0.49 -0.82
[217] -0.12 -0.36 -0.39  1.24  0.58  0.79  1.28 -0.79  1.58  0.19 -1.26  0.44
[229] -0.61  0.23  0.20 -0.06  1.42 -0.30  0.66  0.14  0.48  0.45 -0.16 -0.17
[241] -0.43 -0.61 -0.14 -0.31 -1.36 -0.11  0.35 -0.96 -0.37  1.70  0.30 -0.95
[253]  0.62  0.16  0.41 -0.09 -0.22 -0.59  0.04 -0.21  0.37 -0.81  0.77 -0.59
[265] -0.55 -0.65  0.15 -0.07 -0.24 -0.82 -0.51  0.82  1.12 -0.43  0.11 -0.09
[277] -0.31 -0.49 -0.62 -0.20  1.49 -1.27 -0.64 -0.51 -0.04  0.72  0.98  0.30
[289]  0.28 -0.09  0.47  0.78  0.18  1.53  0.73 -0.21  2.04  0.64 -0.26  0.49
[301] -0.10  0.00  1.19  0.26  0.10  0.07  0.22 -0.68  0.99 -0.70 -0.27 -0.10
[313] -0.33  0.13 -0.19  1.00 -0.06  0.20  1.08  1.08 -0.10 -1.10 -0.42 -1.16
[325]  0.88  0.27 -0.69  0.43  0.06 -1.00 -0.22  1.15  1.22  0.12 -0.23  1.14
[337] -1.09 -0.27 -0.31  0.45  0.73 -0.49  1.64 -0.83 -1.28  0.59 -0.43 -0.11
[349]  0.43 -0.79 -0.63  0.13  0.41  0.34  0.60 -0.93  0.31 -0.15  0.47 -0.15
[361]  0.98 -0.15 -0.19 -0.51  0.54 -0.17  0.37  0.29  0.94 -0.32 -0.63  0.61
[373] -0.73  0.03  0.75  0.58 -1.01 -0.94 -1.89 -0.68  0.45 -0.13  0.39  1.17
[385]  0.54 -0.04  0.58 -0.66  0.05  0.45 -0.06 -0.18 -1.43 -0.69  0.06 -0.55
[397]  1.61  0.05 -0.82  1.60 -0.89 -0.79  0.06  0.11  0.37  0.94 -0.51 -0.93
[409]  1.22  0.13  1.02  0.18  0.94 -0.19 -0.28  2.08 -0.06  0.58 -0.46 -0.11
[421] -0.05  0.96 -0.13  0.76  0.09  0.27 -0.12  0.50 -0.05 -0.55  0.22  1.17
[433]  1.48  0.03  0.50 -0.58 -0.08  0.93  0.35 -0.08  1.15 -0.12 -0.66 -0.60
[445]  0.79  1.91  0.71 -0.42  0.07  0.98 -0.08  0.72  0.57 -0.11  1.30 -0.57
[457] -0.23 -0.51  1.61  0.37  0.84  0.16 -0.46  0.95  0.32 -0.71  0.50  0.38
[469] -0.35  0.82 -0.60  0.69 -0.37 -0.42 -1.42 -0.62 -0.31  0.32  0.87  0.71
[481]  0.78  0.30 -1.00 -2.85  0.26  0.94  0.61 -0.40 -0.32  0.42  0.37 -0.58
[493]  0.68  0.35  0.02 -1.25 -0.15  0.23  2.00  0.59  0.42  1.34 -0.25 -0.63
[505]  1.01 -0.19  0.10 -1.10 -0.72  0.41 -1.30  0.22 -0.99  0.60  1.20 -1.17
[517]  0.92  0.29 -0.30  0.08 -0.34 -0.63  0.03 -0.46 -0.76 -0.25  0.00 -0.91
[529] -0.04 -0.18 -0.63 -0.25 -0.45 -0.07 -0.54 -0.50  0.27  1.49  0.83 -0.20
[541]  0.76 -0.91 -0.36  0.55  0.07  0.27 -0.48 -0.37  0.16  0.25 -0.37 -1.46
[553] -0.10 -1.16 -0.06 -0.94 -0.13  0.12 -0.73 -0.16  0.19  0.35  0.43 -0.54
[565] -0.79 -0.88  0.00  0.05  0.23  0.30  0.41 -1.35 -0.17  0.14  0.88 -0.58
[577]  0.10 -0.17 -0.68 -0.73  0.01 -1.89 -1.06  0.26 -0.38  0.53 -0.70 -0.30
[589] -0.21 -0.63  0.51  0.92 -1.04 -0.87  0.08 -0.58  0.26 -0.76 -0.50 -1.14
[601]  0.85 -1.10  1.52  0.07 -1.04  0.53  0.09  0.95  0.22 -0.36  0.75  0.47
[613] -0.38 -0.37  0.46 -0.34  0.23 -1.42 -1.14  0.16 -0.74 -0.69 -0.23  0.30
[625] -0.37  0.70 -2.32 -0.24  0.48  0.41 -1.28 -0.18 -0.72 -1.15 -0.28 -0.04
[637] -1.45 -0.37 -0.20  0.78  0.52 -0.54 -0.57  1.03  1.04 -0.67  0.08  0.21
[649]  0.47  0.14  1.91  1.43  0.86  0.73  0.18 -0.54  0.59 -1.43 -1.07 -0.21
[661] -0.33  1.20 -0.63  0.59 -0.25  0.15  0.02 -1.13 -1.60  0.87  0.38 -0.63
[673] -0.10 -0.79 -0.13 -0.39  0.34  0.33  1.75 -0.16  0.25 -0.32 -0.72 -0.23
[685] -0.60 -0.72  0.94 -1.13  0.09 -1.17 -0.48 -1.45  0.00 -0.46  0.17 -0.05
[697]  0.24  0.00 -0.32  1.58 -0.70  0.98  0.21  1.02  0.46 -0.67  0.06  1.69
[709]  0.19 -0.28 -1.10  1.07 -0.46  1.68  1.07 -0.61 -0.71 -0.06  0.01  0.22
[721]  0.76 -0.66  0.68 -0.29 -0.45 -1.14  0.95 -2.15 -0.04 -0.69  0.60 -0.21
[733] -0.30  0.50 -0.30  0.10 -0.79  0.28  0.67 -0.06 -1.20 -0.14 -0.19  0.41
[745]  1.09 -1.69 -0.52  1.23  0.21  0.07  0.49 -0.82  0.07 -0.24  0.31  0.29
[757] -0.92  0.67  0.94 -0.24  1.26 -0.13  0.99  0.31 -1.40 -1.10 -2.36 -0.93

and “Llamas are vegetarians and have very efficient digestive systems” becomes

docs[5]
[1] "Llamas are vegetarians and have very efficient digestive systems"
nums[5,] |> round(2) |> as.numeric()
  [1]  0.03  1.87 -3.79 -0.37  0.44 -0.65 -1.73 -0.89 -1.30  0.02  0.49  1.28
 [13]  1.11  0.85  0.59 -0.22 -0.32 -1.05  0.78  0.74 -1.58  1.00  0.29 -0.65
 [25]  1.49  0.19 -0.96  2.20  0.39 -0.14  1.04 -1.08 -0.27 -0.06  0.43  0.42
 [37]  2.19  0.17  1.60 -0.63  1.06  0.15  0.78 -0.11  0.67  0.18  0.96  0.29
 [49]  0.89 -1.16  0.03 -0.94 -1.12  0.32  1.27 -0.63  0.05 -0.24 -0.02 -1.02
 [61]  1.16 -0.16 -0.85  2.17  0.04 -1.57 -0.72  0.10 -0.44 -0.05  0.96 -0.93
 [73]  0.28 -0.80  0.90  0.73  0.05 -0.05  0.02 -0.33  0.85  0.33  0.88 -0.66
 [85]  1.09 -0.12 -0.43 -1.13 -0.04  0.52  0.21  1.14  1.53  0.67 -1.19  0.92
 [97] -0.88  0.32 -0.32 -0.29 -0.39 -0.46  1.09 -0.17  1.15 -0.24  1.34 -0.74
[109] -0.49 -0.02 -0.35  0.17  0.07 -0.23 -0.23 -1.14  1.83 -0.66  0.88  0.70
[121] -0.14 -0.28  0.54  0.18  0.60  0.53 -0.27  0.08  0.51 -0.85 -0.40  0.49
[133]  0.30 -1.06  0.07 -0.08 -0.64 -0.35  0.15  0.29  0.81  0.50  0.42  0.10
[145] -0.12 -1.56  0.70 -0.77 -0.49  0.29  0.60  1.66  0.31  0.96 -0.52 -0.77
[157] -0.27 -0.17 -0.27 -0.12  0.69  0.49  0.07  0.45 -0.14 -0.19  1.99  0.59
[169]  0.83  1.58 -1.30 -0.53 -0.14  0.07  0.66 -0.69  0.19 -1.03  0.85 -0.45
[181]  0.15 -0.83  0.46  1.48 -0.39 -0.07  0.40 -0.24 -1.42 -1.59 -0.71  1.84
[193]  0.37 -1.35 -0.37  0.37  0.15 -0.82  0.35 -1.28 -1.23 -0.74 -2.09  0.30
[205] -1.20  0.37 -0.52 -0.79 -0.62  0.39  0.63 -0.20 -0.09  0.10  0.02 -0.35
[217] -0.91  0.05  0.32  0.96  1.01  0.31  0.86 -0.66  1.21 -0.42 -1.47 -0.18
[229] -1.21  0.37  0.45 -0.11  1.46 -0.39  0.14  0.20  0.45  0.72  0.68 -0.61
[241] -0.31 -0.56 -0.07  0.18 -0.78 -0.06  0.32 -1.06 -0.55  1.60  0.99 -0.94
[253]  0.39 -0.38  0.00 -0.13  0.10 -0.72  1.08 -0.65  0.36 -0.89  0.31  0.32
[265] -0.05 -0.71  0.33  0.60 -0.29 -1.26 -0.31 -0.33  0.77 -0.16  0.11  0.23
[277] -0.44 -0.52 -0.12  0.26  0.92 -1.72 -0.90  0.51 -0.24  0.61  0.65 -0.29
[289] -0.07 -0.30  0.38  0.92 -0.23  1.59 -0.21  0.15  1.97  0.08 -0.08 -0.26
[301] -0.09  0.28  0.99  0.88 -0.22  0.50  0.59 -0.40  1.01 -0.34  0.11  0.13
[313] -0.57 -0.02 -0.46  0.12 -0.22  0.38  1.51  1.08  0.57 -1.07 -0.69 -1.50
[325]  0.66  0.42 -0.19  0.45  0.29 -1.02  0.17  1.35  1.13 -0.49 -0.64  0.61
[337] -0.50  0.14 -0.56  0.21  0.01 -0.73  1.42 -0.68 -1.00  0.04 -0.41 -0.78
[349]  0.16 -0.46 -0.05  0.05  0.61  0.96 -0.40 -0.58  0.41 -0.60  0.73  0.07
[361]  0.56 -0.63  0.05 -0.65  0.68  0.42  0.27 -0.68  0.32 -0.17 -0.33 -0.06
[373] -0.93  0.32  0.36  0.46 -1.18 -0.89 -1.50 -0.21  1.06 -0.81 -0.19  0.41
[385]  0.14 -0.75  0.49 -0.55  0.08  0.68  0.33 -1.21 -1.43 -0.03 -0.03 -0.44
[397]  2.03 -0.58  0.12  1.87 -0.46 -0.85  0.05 -0.09 -0.01  0.63 -0.25 -0.71
[409]  0.60 -0.15  1.03  1.07  1.02 -0.37 -0.67  1.28 -0.30 -0.27 -0.45 -0.33
[421] -0.32  1.06  0.20  0.39 -0.15  0.29  0.00  1.53  0.06 -0.65  0.48  0.22
[433]  1.36  0.54  0.48  0.53 -1.32  0.49  0.72  0.99  0.77 -0.37 -0.81 -1.56
[445]  0.59  1.46  0.62 -0.98 -0.06 -0.44  1.23  1.28  0.52  0.36  1.53 -0.15
[457] -0.78 -0.45  0.95  0.29 -0.13  0.06 -0.75  0.52 -0.81 -0.10  0.95  0.95
[469]  0.24  0.55 -0.27  0.55 -0.24 -0.59 -1.32 -0.38 -0.62  0.43  0.66  0.93
[481]  0.65 -0.15 -0.46 -1.91  0.59  0.50  0.95 -1.03  0.23  0.45 -0.48  0.46
[493]  0.44 -0.09 -0.05 -0.79  0.18 -0.31  0.77  0.67 -0.28  0.35  0.10 -0.84
[505]  0.61 -0.43  0.19 -0.89 -0.75  0.87 -0.62  0.67 -0.38  0.55  1.30 -1.03
[517]  1.02  0.39 -0.85  0.39  0.38 -0.28  0.33 -0.11 -0.77 -0.01  0.10 -0.90
[529]  0.30 -0.14 -0.32 -0.34 -0.23  0.57 -1.03 -0.40  0.25  1.94  0.80 -0.56
[541]  0.93 -0.71 -0.47  0.42  0.04  0.22 -1.44 -0.52 -0.77 -0.25 -1.06 -1.64
[553]  0.58 -1.39  0.81 -0.53 -0.23  0.93 -0.37 -0.58  0.79  0.70  1.21 -0.91
[565] -0.76 -1.46  0.03 -0.33  0.36  0.19  0.53 -0.37  0.54  0.56  0.57 -0.86
[577] -0.09 -0.38 -0.98 -0.73  0.57 -1.79 -0.99  0.03  0.03  0.61 -0.20 -0.12
[589]  0.14 -0.64  0.44  0.75 -0.66 -0.87  0.41 -0.27 -0.44 -0.89 -0.75 -1.20
[601]  0.61 -0.82  0.94 -0.64 -0.60  0.55 -0.37  1.11 -0.10 -0.41  0.14  0.69
[613] -0.85 -1.04  0.58 -0.62  0.04 -0.99 -0.83  0.42 -0.80 -1.02  0.49 -0.26
[625] -0.07 -0.09 -2.10 -0.17  0.01 -0.05 -1.10  0.33 -0.29 -1.38 -0.98 -0.56
[637] -0.99 -0.30  0.44  0.34  0.80 -0.02  0.00  0.54  0.88  0.59  0.36 -0.20
[649]  0.62 -0.57  1.41  1.53  1.31  0.14  0.57  0.04  0.04 -0.54 -1.09 -0.86
[661]  0.33  0.89  0.28  1.39  0.48 -0.36 -0.34 -1.14 -1.87  1.21  0.62 -0.41
[673] -0.30 -0.70  0.23 -0.01  0.57 -0.02  1.55  0.02  0.44 -0.76 -0.37 -0.19
[685]  0.03 -1.28  1.03 -1.22  0.47 -0.57 -0.02 -1.39 -0.37 -0.11 -0.12 -0.44
[697] -0.43  0.40 -0.09  1.15 -0.40  1.21  0.42  1.58  0.27 -1.64  0.24  1.16
[709]  0.00  0.17 -0.66  0.40 -0.88  1.90  1.38 -0.51 -1.51 -0.78  0.18  0.24
[721]  1.24  0.02  0.11 -0.39 -0.06 -0.99  1.13 -1.95  0.17 -0.54  0.93 -0.56
[733] -0.55  0.84 -0.31  0.18 -0.82  0.40 -0.16  0.24 -1.01  0.41 -0.23  0.39
[745]  1.50 -0.70 -0.62  0.51  1.06  0.04  0.84 -1.67 -0.99 -0.16 -0.30  0.42
[757] -0.28  0.82  0.86  0.12  1.04  0.23  1.23 -0.35 -1.74 -1.36 -1.54 -0.83

Now, if we have a related query, how can we find related text strings (based on the numerical values)?

library(lsa)  # for cosine similarity
q_text1 = "What animals are llamas related to?"
q_text2 = "How long llamas can live?"

For the first query, “What animals are llamas related to?”

q_num = embed_text(q_text1, model_embed) # turn into numerical vector
mat = rbind(as.numeric(q_num), as.matrix(nums)) # bind query text with existing texts
mat = t(mat) # need to change rows to columns for cosine function in lsa
cos_sim = cosine(mat) # cosine similarities between the vectors
tib = tibble(Text = c(q_text1, docs), Similarity = cos_sim)  # bind text with the values obtained
tib = tib |> dplyr::arrange(desc(Similarity)) # cosine similarity values in descending order
tib
# A tibble: 7 × 2
  Text                                                            Similarity[,1]
  <chr>                                                                    <dbl>
1 What animals are llamas related to?                                      1    
2 Llamas are members of the camelid family meaning they're prett…          0.899
3 Llamas are vegetarians and have very efficient digestive syste…          0.800
4 Llamas weigh between 280 and 450 pounds and can carry 25 to 30…          0.777
5 Llamas were first domesticated and used as pack animals 4,000 …          0.759
6 Llamas can grow as much as 6 feet tall though the average llam…          0.725
7 Llamas live to be about 20 years old, though some only live fo…          0.724
# ℹ 1 more variable: Similarity[2:7] <dbl>

which shows that “Llamas are members of the camelid family meaning they’re pretty closely related to vicunas and camels” is highly similar (similarity = 0.899).

For the second query, “How long llamas can live?”

q_num = embed_text(q_text2, model_embed)
mat = rbind(as.numeric(q_num), as.matrix(nums))
mat = t(mat)
cos_sim = cosine(mat)
tib = tibble(Text = c(q_text2, docs), Similarity = cos_sim)
tib = tib |> dplyr::arrange(desc(Similarity))
tib
# A tibble: 7 × 2
  Text                                                            Similarity[,1]
  <chr>                                                                    <dbl>
1 How long llamas can live?                                                1    
2 Llamas live to be about 20 years old, though some only live fo…          0.915
3 Llamas can grow as much as 6 feet tall though the average llam…          0.783
4 Llamas weigh between 280 and 450 pounds and can carry 25 to 30…          0.768
5 Llamas were first domesticated and used as pack animals 4,000 …          0.742
6 Llamas are vegetarians and have very efficient digestive syste…          0.725
7 Llamas are members of the camelid family meaning they're prett…          0.696
# ℹ 1 more variable: Similarity[2:7] <dbl>

it rightly points out that “Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old” is highly similar (similarity = 0.915).

References

Aria, M. (2020). pubmedR: Gathering metadata about publications, grants, clinical trials from PubMed database. Retrieved from https://github.com/massimoaria/pubmedR
Aria, M., & Cuccurullo, C. (2017). Bibliometrix: An r-tool for comprehensive science mapping analysis. Journal of Informetrics. https://doi.org/10.1016/j.joi.2017.08.007
Aria, M., & Cuccurullo, C. (2024). Bibliometrix: Comprehensive science mapping analysis. Retrieved from https://www.bibliometrix.org
Gruber, J. B., & Weber, M. (2024). Rollama: Communicate with ollama to run large language models locally. Retrieved from https://jbgruber.github.io/rollama/
R Core Team. (2024). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/