Ingesting Wikipedia into ApertureDB

This notebook shows how to take an existing set of documents with embeddings and load them into ApertureDB so that it can be used in a RAG chain to answer questions.

First we need to install a few libraries.

%pip install --quiet aperturedb langchain langchain-community langchainhub datasets

Note: you may need to restart the kernel to use updated packages.

Load dataset

We use the Hugging Face Datasets library to load a dataset provided by Cohere. This contains the content of Wikipedia (from November 2023), already cleaned up, chunked, and with pre-generated embeddings.

We've included a restriction on the number of documents in order to speed you through the notebook and make sure that you don't run out of RAM. Feel free to comment out that line and take coffee breaks instead.

This may take a minute to run. You might see a warning about HF_TOKEN when you run this code. This is harmless.

from datasets import load_dataset
lang = "simple" # Smaller than the "en" dataset
full_dataset = load_dataset("Cohere/wikipedia-2023-11-embed-multilingual-v3", lang)
dataset = full_dataset["train"]
print(len(dataset))
N_DOCS = 10000
dataset = dataset.select(range(N_DOCS)) # Comment this line out to use the full dataset
print(len(dataset))

646424
10000

Wrap these embeddings for LangChain

LangChain expects a class that will create embeddings on-the-fly, but we have a set of pre-computed embeddings. This is a wrapper class that bridges the gap.

import langchain_core.embeddings
try:
    from typing import override
except ImportError:
    def override(func):
        return func

class PrecomputedEmbeddings(langchain_core.embeddings.embeddings.Embeddings):
    @classmethod
    def from_dataset(class_, dataset):
        result = class_()
        result.index = {doc['text']: doc['emb'] for doc in dataset}
        return result

    @override
    def embed_documents(self, texts):
        # Will throw if text is not in index
        return [self.index[text] for text in texts]

    @override
    def embed_query(self, query):
        # Will throw if text is not in index
        return self.index[query]

Now we can create our LangChain embeddings object that will work on the Wikipedia corpus.

If you elected not to use a subset of documents, this will take

embeddings = PrecomputedEmbeddings.from_dataset(dataset)

Connect to ApertureDB

For the next part, we need access to a specific ApertureDB instance. There are several ways to set this up. The code provided here will accept ApertureDB connection information as a JSON string. See our Configuration help page for more options.

! adb config create --from-json --active 

Here we create a LangChain vectorstore using ApertureDB. We use the default client configuration that we have already set up.

If you want to create more than one version of the embeddings, then change the DESCRIPTOR_SET name.

We know that the Cohere embeddings are 1024-dimensional. See AddDescriptorSet for more information about selecting an engine and metric.

We use the embeddings object we created above, which will be used when we add documents to the vectorstore.

from langchain_community.vectorstores import ApertureDB

DESCRIPTOR_SET = 'cohere_wikipedia_2023_11_embed_multilingual_v3'

vectorstore = ApertureDB(
    embeddings=embeddings,
    descriptor_set=DESCRIPTOR_SET,
    dimensions=1024,
    engine="HNSW",
    metric="CS",
    log_level="INFO"
)

Convert from Hugging Face to LangChain

Hugging Face documents are not exactly the same as LangChain documents so we have to convert them. This will take a few minutes.

from langchain.docstore.document import Document

def hugging_face_document_to_langchain(doc):
    return Document(page_content=doc["text"], metadata={"url": doc["url"], "title": doc["title"], "id": doc["_id"]})


docs = [hugging_face_document_to_langchain(doc) for doc in dataset]
print(len(docs))

Load the documents into the vectorstore

Finally, we come to the part where we load the documents into the vectorstore. Again, this will take a little while to run.

The full process takes a while, so we've restricted it here to a few thousand documents so you can progress through the notebook. You can remove this limit and go for lunch instead.

Once you add the documents, your ApertureDB instance will be hard at work building a high-performance index for them.

ids = vectorstore.add_documents(docs)

Let's check out how many documents are in our vectorstore.

import json
print(json.dumps([ d for d in ApertureDB.list_vectorstores() if d['_name'] == DESCRIPTOR_SET ], indent=2))

[
  {
    "_count": 10000,
    "_dimensions": 1024,
    "_engines": [
      "HNSW"
    ],
    "_metrics": [
      "CS"
    ],
    "_name": "cohere_wikipedia_2023_11_embed_multilingual_v3",
    "_uniqueid": "2.1.220"
  }
]

Tidy up

If you want to tidy up and restore your ApertureDB instance to before, you can delete the vectorstore.

We've deliberately left this next box not executable so you can go on to use your database.

ApertureDB.delete_vectorstore(DESCRIPTOR_SET)

What's next?

Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.

See Building a RAG Chain from Wikipedia.

Load dataset​

Wrap these embeddings for LangChain​

Connect to ApertureDB​

Convert from Hugging Face to LangChain​

Load the documents into the vectorstore​

Tidy up​

What's next?​

Further information​