Skip to main content

Building a RAG chain from Wikipedia

Open In Colab Download View source on GitHub

This notebook shows how to use ApertureDB as part of a Retrieval-Augmented Generation Langchain pipeline. This means that we're going to use ApertureDB as a vector-based search engine to find documents that match the query and then use a large-language model to generate an answer based on those documents.

We already have a corpus of >600k paragraphs from the Simple English Wikipedia with associated embeddings provided by Cohere. (If not, see Ingesting Wikipedia into ApertureDB). We'll use that to answer natural-language questions.

RAG workflow

Install Dependencies

%pip install --quiet aperturedb langchain langchain-core langchain-community langchainhub langchain-cohere
Note: you may need to restart the kernel to use updated packages.

Choose a prompt

The prompt ties together the source documents and the user's query, and also sets some basic parameters for the chat engine.

from langchain_core.prompts import PromptTemplate
prompt = PromptTemplate.from_template("""You are an assistant for question-answering tasks. Use the following documents to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Additionally, you should always indicate which documents support each part of your answer.
Question: {question}
{context}
Answer:""")
print(prompt.template)
You are an assistant for question-answering tasks. Use the following documents to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.  Additionally, you should always indicate which documents support each part of your answer.
Question: {question}
{context}
Answer:

For comparison, we're also going to ask the same questions of the language model without using documents. This prompt is for a non-RAG chain.

from langchain_core.prompts import PromptTemplate
prompt2 = PromptTemplate.from_template("""You are an assistant for question-answering tasks. Answer the question from your general knowledge. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Answer:""")
print(prompt2.template)
You are an assistant for question-answering tasks. Answer the question from your general knowledge.  If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Answer:

Cohere API Key

In order to continue with this demo, you will need to enter an API key for Cohere. An evaluation API key can be obtained for free from dashboard.cohere.com/api-keys.

import os
from getpass import getpass

os.environ['COHERE_API_KEY'] = getpass()

Select an embedding scheme

Here we select the embedding scheme that matches the embeddings we have preloaded.

from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-multilingual-v3.0")

emb = embeddings.embed_query("Hello, world!")
print(emb[:10], len(emb))
[0.0030612946, 0.046173096, 0.024490356, 0.032440186, -0.028900146, -0.026855469, -0.02810669, -0.03074646, -0.068481445, 0.033966064] 1024

Select a vectorstore

Here we're using an instance of ApertureDB that has already been pre-loaded with a selection of paragraphs from Wikipedia.

First activate the connection to ApertureDB.

import os
from getpass import getpass
os.environ['APERTUREDB_JSON'] = getpass()

Create vectorstore

Now we create a LangChain vectorstore object, backed by the ApertureDB instance we have already uploaded documents to.

from langchain_community.vectorstores import ApertureDB
import logging
import sys
# date_strftime_format = "%Y-%m-%y %H:%M:%S"
# logging.basicConfig(stream=sys.stdout, level=logging.WARN,
# format="%(asctime)s %(levelname)s %(funcName)s %(message)s", datefmt=date_strftime_format)

DESCRIPTOR_SET = "cohere_wikipedia_2023_11_embed_multilingual_v3"

vectorstore = ApertureDB(embeddings=embeddings,
descriptor_set=DESCRIPTOR_SET)

Create a retriever

The retriever is responsible for finding the most relevant documents in the vectorstore for a given query. Here's we using the "max marginal relevance" retriever, which is a simple but effective way to find a diverse set of documents that are relevant to a query. For each query, we retrieve the top 10 documents, but we do so by fetching 20 and then selecting the top 5 using the MMR algorithm.

search_type = "mmr" # "similarity" or "mmr"
k = 10 # number of results used by LLM
fetch_k = 100 # number of results fetched for MMR
retriever = vectorstore.as_retriever(search_type=search_type,
search_kwargs=dict(k=k, fetch_k=fetch_k))

Select an LLM engine

Here we're again using Cohere, but there's no need to use the same provider as we used for embeddings.

from langchain_cohere import ChatCohere

llm = ChatCohere(model="command-r")

Build the chain

Now we put it all together. The chain is responsible for taking a user query and returning a response. It does this by first retrieving the most relevant documents using vector search, then using the LLM to generate a response.

For demonstration purposes, we're printing the documents that were retrieved, but in a real application you would probably want to hide this information from the user.

from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
return "\n\n".join(f"Document {i}: " + doc.page_content for i, doc in enumerate(docs, start=1))


rag_chain = (
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
| prompt
| llm
| StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
{"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain)

This chain does not use RAG.

plain_chain = (
{"question": RunnablePassthrough()}
| prompt2
| llm
| StrOutputParser()
)

Look at some documents

In order to come up with questions that match the corpus, it might be a good idea to look at some random documents.

from aperturedb.CommonLibrary import create_connector
offset = 0
query = [ {"FindDescriptor": {"set": DESCRIPTOR_SET, "results": { "list": ["text", "lc_title"], "limit": 10}, "offset": offset, "sort": { "key": "uniqueid" } }} ]
client = create_connector()
response, _ = client.query(query)
print(response)
for i, result in enumerate(list(response[0].values())[0]["entities"], start=1):
print(f"{i}. {result['lc_title']}: {result['text']}")
[{'FindDescriptor': {'entities': [{'lc_title': 'Mathematics', 'text': '|| Ordinal numbers || Cardinal numbers || Arithmetic operations || Arithmetic relations || Functions, see also special functions'}, {'lc_title': 'Vector', 'text': 'To multiply a vector by a scalar (a normal number), you multiply the number by each component of the vector:'}, {'lc_title': 'Natural resource', 'text': 'A renewable resource is one that can be used again and again. For example, soil, sunlight and water are renewable resources. However, in some circumstances, even water is not renewable easily. Wood is a renewable resource, but it takes time to renew, and in some places, people use the land for something else. Soil, if it blows away, is not easy to renew.'}, {'lc_title': 'Finland', 'text': 'Finland (Finnish: Suomi) is a country in Northern Europe and is a member state of the European Union. Finland is one of the Nordic countries and is also part of Fennoscandia. Finland is located between the 60th and 70th latitudes North. Its neighbours are Sweden to the west, Norway to the north, Russia to the east and Estonia to the south, beyond the sea called Gulf of Finland. Most of western and southern coast is on the shore of the Baltic Sea.'}, {'lc_title': 'Russian language', 'text': 'In Russian, an adjective must agree with the word that it describes in gender, case and number. In the nominative case, adjectives that describe feminine words usually end in -ая or -яя. Those that describe masculine words usually end in -ый, -ий or -ой. Those that describe neuter words usually end in -ое or -ее. Those that describe plural words usually end in -ые or -ие. The endings change depending on case.'}, {'lc_title': '2004', 'text': 'March 1 - Prime Minister Ahmed Qurie blasted ongoing Israeli extrajudicial executions of Palestinian activists, which claimed two more lives on Sunday, and blamed Israel for the weekend of violence, whilst accusing his Israeli counterpart’s government of trying "to kill any possibility for (achieving a) mutual cease-fire".'}, {'lc_title': 'Medicine', 'text': "Doctors in this field, abbreviated OBGYN or Obs/Gyn, specialize in women's health covering conditions of the female reproductive organs, and pregnancy care and delivery. Some examples of gynecological issues they deal with include contraceptive medicine, fertility workup and treatments, prolapse and incontinence, sexual health, ovarian tumors/ cysts, gynecological oncology. They are also surgeons in their fields, capable of performing numerous gynecological surgeries. Doctors in this field also practice obstetrical medicine, specializing in maternal fetal care and deliveries, complications related to deliveries, assisted deliveries (such as vacuum and forceps deliveries) and Caesarian sections."}, {'lc_title': 'Acceleration', 'text': 'Acceleration has its own units of measurement. For example, if velocity is measured in meters per second, and if time is measured in seconds, then acceleration is measured in meters per second squared  (m/s2).'}, {'lc_title': 'Like', 'text': 'This cheese sandwich feels like rubber = the sandwich is difficult to eat, nearly the same as rubber.'}, {'lc_title': 'Inuit', 'text': 'Inuit were also Nomadic people, but they did not domesticate any animals except for dogs, which they used to pull their sleds and help with the hunting. They were hunter-gatherers, living off the land. They were very careful to make good use of every part of the animals they killed. Respect for the land and the animals they harvested was and is a focal part of their culture.'}], 'returned': 10, 'status': 0}}]
1. Mathematics: || Ordinal numbers || Cardinal numbers || Arithmetic operations || Arithmetic relations || Functions, see also special functions
2. Vector: To multiply a vector by a scalar (a normal number), you multiply the number by each component of the vector:
3. Natural resource: A renewable resource is one that can be used again and again. For example, soil, sunlight and water are renewable resources. However, in some circumstances, even water is not renewable easily. Wood is a renewable resource, but it takes time to renew, and in some places, people use the land for something else. Soil, if it blows away, is not easy to renew.
4. Finland: Finland (Finnish: Suomi) is a country in Northern Europe and is a member state of the European Union. Finland is one of the Nordic countries and is also part of Fennoscandia. Finland is located between the 60th and 70th latitudes North. Its neighbours are Sweden to the west, Norway to the north, Russia to the east and Estonia to the south, beyond the sea called Gulf of Finland. Most of western and southern coast is on the shore of the Baltic Sea.
5. Russian language: In Russian, an adjective must agree with the word that it describes in gender, case and number. In the nominative case, adjectives that describe feminine words usually end in -ая or -яя. Those that describe masculine words usually end in -ый, -ий or -ой. Those that describe neuter words usually end in -ое or -ее. Those that describe plural words usually end in -ые or -ие. The endings change depending on case.
6. 2004: March 1 - Prime Minister Ahmed Qurie blasted ongoing Israeli extrajudicial executions of Palestinian activists, which claimed two more lives on Sunday, and blamed Israel for the weekend of violence, whilst accusing his Israeli counterpart’s government of trying "to kill any possibility for (achieving a) mutual cease-fire".
7. Medicine: Doctors in this field, abbreviated OBGYN or Obs/Gyn, specialize in women's health covering conditions of the female reproductive organs, and pregnancy care and delivery. Some examples of gynecological issues they deal with include contraceptive medicine, fertility workup and treatments, prolapse and incontinence, sexual health, ovarian tumors/ cysts, gynecological oncology. They are also surgeons in their fields, capable of performing numerous gynecological surgeries. Doctors in this field also practice obstetrical medicine, specializing in maternal fetal care and deliveries, complications related to deliveries, assisted deliveries (such as vacuum and forceps deliveries) and Caesarian sections.
8. Acceleration: Acceleration has its own units of measurement. For example, if velocity is measured in meters per second, and if time is measured in seconds, then acceleration is measured in meters per second squared (m/s2).
9. Like: This cheese sandwich feels like rubber = the sandwich is difficult to eat, nearly the same as rubber.
10. Inuit: Inuit were also Nomadic people, but they did not domesticate any animals except for dogs, which they used to pull their sleds and help with the hunting. They were hunter-gatherers, living off the land. They were very careful to make good use of every part of the animals they killed. Respect for the land and the animals they harvested was and is a focal part of their culture.

Run the chain

Now we can enter a query and see the response.

from IPython.display import display, Markdown

def run_query(user_query):
display(Markdown(f"### User Query\n{user_query}"))

nonrag_answer = plain_chain.invoke(user_query)
display(Markdown(f"### Non-RAG Answer\n{nonrag_answer}"))

rag_answer = rag_chain_with_source.invoke(user_query)
display(Markdown("\n".join([
f"### RAG Answer\n{rag_answer['answer']}",
f"### Documents",
*(f"{i}. **[{doc.metadata['title']}]({doc.metadata['url']})**: {doc.page_content}" for i, doc in enumerate(rag_answer["context"], 1))
])))


user_query = input("Enter a question:")
assert user_query, "Please enter a question."
run_query(user_query)

User Query

What animals did the Inuit domesticate?

Non-RAG Answer

The Inuit domesticated the caribou, which they hunted for food, clothing, and equipment. They also kept dogs as pets and for pulling sleds. These animals were integral to Inuit culture and society.

RAG Answer

The Inuit domesticated dogs, which they used for hunting and pulling sleds. They also hunted a variety of animals, including seals, polar bears, caribou, and whales. Some sources suggest that the Inuit did not domesticate any animals other than dogs, while others imply that they hunted any animals they found.

Documents

  1. Inuit: Inuit were also Nomadic people, but they did not domesticate any animals except for dogs, which they used to pull their sleds and help with the hunting. They were hunter-gatherers, living off the land. They were very careful to make good use of every part of the animals they killed. Respect for the land and the animals they harvested was and is a focal part of their culture.
  2. Inuit: Inuit had to be good hunters to survive. When an animal was killed in a hunt, it was thanked respectfully for offering itself to the hunter. They believed it intended to provide itself as a gift towards the survival of the hunter and his children. Their gratitude was deeply sincere and is an important aspect of their belief system. In the winter, seals did not come out onto the ice. They only came up for air at holes they chewed in the ice. Inuit would use their dogs to find the air holes, then wait patiently until the seal came back to breathe and kill it with a harpoon. In the summer, the seals would lie out on the ice enjoying the sun. The hunter would have to slowly creep up on a seal to kill it. The Inuit would use their dogs and spears to hunt polar bears, musk ox, and caribou. Sometimes they would kill caribou from their boats as the animals crossed the rivers on their migration.
  3. Inuit: The Inuit even hunted whales. From their boat, they would throw harpoons that were attached to floats made of inflated seal skins. The whale would grow tired from dragging the floats under the water. When it slowed down and came up to the surface, the Inuit could keep hitting it with more harpoons or spears until it died. Whale blubber provide Vitamin D and Omegas to their cultural diet, and prevented rickets. The whaling industry around the world has depleted the whale population, and now traditional whale hunting for subsistence purposes is rare around the world. Inuits have added to their modern northern diet with grocery foods, which are normally very expensive in the north.
  4. Inuit: Inuit ate both raw and cooked meat and fish, as well as the fetus's of pregnant animals. Whale blubber was burned as fuel for cooking and lamps.
  5. Inuit: Inuit lived in tents made of animal skins during the summer. In the winter they lived in sod houses and igloos. They could build an igloo out of snow bricks in just a couple of hours. Snow is full of air spaces, which helps it hold in warmth. With just a blubber lamp for heat, an igloo could be warmer than the air outside. The Inuit made very clever things from the bones, antlers, and wood they had. They invented the harpoon, which was used to hunt seals and whales. They built boats from wood or bone covered with animal skins. They invented the kayak for one man to use for hunting the ocean and among the pack ice.
  6. Inuit: During the summer months, the Inuit were able to gather berries and roots to eat. They also collected grass to line their boots or make baskets. Often the food they found or killed during the summer was put into a cache for use during the long winter. A cache was created by digging down to the permafrost and building a rock lined pit there. The top would be covered with a pile of rocks to keep out the animals. It was as good as a freezer, because the food would stay frozen there until the family needed it. Inuit cultural traditions and traditional stories provided each new generation with the lifeskills and knowledge to survive their environment and work together. They usually moved around in small groups looking for food, and sometimes they would get together with other groups to hunt for larger animals such as whales. The men did the hunting and home building, and also made weapons, sleds, and boats. The women cooked, made the clothes, and took care of the children. Children and infants under the ages of 5 became easy victims of hypothermia, and if they were to die, their mothers would weight the children's corpses with stones and wrap them in fishnets before placing the bodies through holes in the ice. The mothers believed the children's souls were being offered to the god Phallus, who would reincarnate them as whales.
  7. Inuit: Today, most Inuit live in modern houses. Many still hunt or fish for a major part of their food supply or for income. Seal pelts are used to protect from the extreme Arctic cold. The technology has worked well for many thousands of years. Besides, commercial winter clothes are expensive. Today, they use rifles and snowmobiles when hunting, however traditional values respecting the animals hunted still very much applies. In Alaska, many of the people have received money from the oil discovered in that state on their traditional lands.
  8. Inuit: Inuit sleds could be built from wood, bone, or even animal skins wrapped around frozen fish. Dishes were made from carving soapstone, bones, or musk ox horns. They wore two layers of skins, one fur side in, the other facing out, to stay warm.
  9. Arctic: Eskimos are Arctic people, too. They sometimes ate raw meat. Eskimos were also nomads, but they did not have any animals except for dogs, which they used for pulling their sleds and helping them hunt. They were hunters and gatherers, and they lived off on whatever they found or killed. Like the Lapps, though, they were very careful to make good use of every part of the animals they killed. Eskimos lived in tents during the summer, and sod houses or igloos in the winter. The Eskimos made very clever things from the bones, antlers, and wood they had. They built different kinds of boats.
  10. Inuit: Inuits in Alaska have various concerns, such as protecting the caribou from American oil pipelines. Anti-seal hunt campaigns work to eliminate this aspect of northern culture, which most Inuits regard as vital to their lives.

What's Next?

If you'd like to try assembling your own RAG corpus by crawling a website, see Ingesting a Website into ApertureDB