Skip to main content

Ingesting a Website into ApertureDB

Open In Colab Download View source on GitHub

This notebook shows how to take web content and load it into ApertureDB so that it can be used in a RAG chain to answer questions.

First we need to install a few libraries.

%pip install --quiet --upgrade aperturedb langchain langchain-community langchainhub gpt-web-crawler Twisted gpt4all

Crawl the Website

We're going to use the gpt-web-crawler package to crawl a website for us.

First we grab the default configuration file. This is where you can insert API keys for advanced services.

!wget https://raw.githubusercontent.com/Tim-Saijun/gpt-web-crawler/refs/heads/main/config_template.py -O config.py

Now we do the actual crawl. We've configured this to point to our documentation website, but feel free to change the starting URL.

START_URLS = "https://docs.aperturedata.io/"
MAX_PAGES = 1000
OUTPUT_FILE = "output.json"

# Delete the output file if it exists
import os
if os.path.exists(OUTPUT_FILE):
os.remove(OUTPUT_FILE)

from gpt_web_crawler import run_spider, NoobSpider

run_spider(NoobSpider,
max_page_count=MAX_PAGES,
start_urls=START_URLS,
output_file="output.json",
extract_rules=r'.*')

Create Documents

Now we load the website crawl and turn it into LangChain documents.

from langchain_core.documents import Document
import json


with open("output.json") as f:
data = json.load(f)

documents = [
Document(
page_content=d['body'],
id=d['url'],
metadata={
'title': d['title'],
'keywords': d['keywords'],
'description': d['description'],
'url': d['url']
}
) for d in data
]
print(len(documents))

Split Documents into Segments

Generally a web page is too large and diverse to be useful in a RAG chain. Instead we break the document up into segments. LangChain provides support for this.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=64,
)

segments = text_splitter.split_documents(documents)
print(len(segments))

Choose an Embedding

Here we're using the GPT4All package and loading one of its smaller models. Don't worry if you see messages about CUDA libraries being unavailable.

from langchain_community.embeddings import GPT4AllEmbeddings

embeddings = GPT4AllEmbeddings(model_name="all-MiniLM-L6-v2.gguf2.f16.gguf")
embeddings_dim = len(embeddings.embed_query("test"))
print(f"Embeddings dimension: {embeddings_dim}")

Connect to ApertureDB

For the next part, we need access to a specific ApertureDB instance. There are several ways to set this up. The code provided here will accept ApertureDB connection information as a JSON string. See our Configuration help page for more options.

! adb config create --from-json --active 

Here we create a LangChain vectorstore using ApertureDB. We use the default client configuration that we have already set up.

If you want to create more than one version of the embeddings, then change the DESCRIPTOR_SET name.

See AddDescriptorSet for more information about selecting an engine and metric.

We use the embeddings object we created above, which will be used when we add documents to the vectorstore.

from langchain_community.vectorstores import ApertureDB

DESCRIPTOR_SET = 'my_website'

vectorstore = ApertureDB(
embeddings=embeddings,
descriptor_set=DESCRIPTOR_SET,
dimensions=embeddings_dim,
engine="HNSW",
metric="CS",
log_level="INFO"
)

Load the documents into the vectorstore

Finally, we come to the part where we load the documents into the vectorstore. Again, this will take a little while to run.

The full process takes a while, so we've restricted it here to a few thousand documents so you can progress through the notebook. You can remove this limit and go for lunch instead.

Once you add the documents, your ApertureDB instance will be hard at work building a high-performance index for them.

ids = vectorstore.add_documents(segments)

Let's check out how many documents are in our vectorstore.

import json
print(json.dumps([ d for d in ApertureDB.list_vectorstores() if d['_name'] == DESCRIPTOR_SET ], indent=2))

Tidy up

If you want to tidy up and restore your ApertureDB instance to before, you can delete the vectorstore.

We've deliberately left this next box not executable so you can go on to use your database.

ApertureDB.delete_vectorstore(DESCRIPTOR_SET)

What's next?

Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.

See Building a RAG Chain from a Website.

Further information