Ingesting a Website into ApertureDB
This notebook shows how to take web content and load it into ApertureDB so that it can be used in a RAG chain to answer questions.
First we need to install a few libraries.
%pip install --quiet --upgrade aperturedb langchain langchain-community langchainhub gpt-web-crawler Twisted gpt4all
Crawl the Website
We're going to use the gpt-web-crawler
package to crawl a website for us.
First we grab the default configuration file. This is where you can insert API keys for advanced services.
!wget https://raw.githubusercontent.com/Tim-Saijun/gpt-web-crawler/refs/heads/main/config_template.py -O config.py
Now we do the actual crawl. We've configured this to point to our documentation website, but feel free to change the starting URL.
START_URLS = "https://docs.aperturedata.io/"
MAX_PAGES = 1000
OUTPUT_FILE = "output.json"
# Delete the output file if it exists
import os
if os.path.exists(OUTPUT_FILE):
os.remove(OUTPUT_FILE)
from gpt_web_crawler import run_spider, NoobSpider
run_spider(NoobSpider,
max_page_count=MAX_PAGES,
start_urls=START_URLS,
output_file="output.json",
extract_rules=r'.*')
Create Documents
Now we load the website crawl and turn it into LangChain documents.
from langchain_core.documents import Document
import json
with open("output.json") as f:
data = json.load(f)
documents = [
Document(
page_content=d['body'],
id=d['url'],
metadata={
'title': d['title'],
'keywords': d['keywords'],
'description': d['description'],
'url': d['url']
}
) for d in data
]
print(len(documents))
Split Documents into Segments
Generally a web page is too large and diverse to be useful in a RAG chain. Instead we break the document up into segments. LangChain provides support for this.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=64,
)
segments = text_splitter.split_documents(documents)
print(len(segments))
Choose an Embedding
Here we're using the GPT4All package and loading one of its smaller models. Don't worry if you see messages about CUDA libraries being unavailable.
from langchain_community.embeddings import GPT4AllEmbeddings
embeddings = GPT4AllEmbeddings(model_name="all-MiniLM-L6-v2.gguf2.f16.gguf")
embeddings_dim = len(embeddings.embed_query("test"))
print(f"Embeddings dimension: {embeddings_dim}")
Connect to ApertureDB
For the next part, we need access to a specific ApertureDB instance. There are several ways to set this up. The code provided here will accept ApertureDB connection information as a JSON string. See our Configuration help page for more options.
! adb config create --from-json --active
Here we create a LangChain vectorstore using ApertureDB. We use the default client configuration that we have already set up.
If you want to create more than one version of the embeddings, then change the DESCRIPTOR_SET
name.
See AddDescriptorSet for more information about selecting an engine and metric.
We use the embeddings object we created above, which will be used when we add documents to the vectorstore.
from langchain_community.vectorstores import ApertureDB
DESCRIPTOR_SET = 'my_website'
vectorstore = ApertureDB(
embeddings=embeddings,
descriptor_set=DESCRIPTOR_SET,
dimensions=embeddings_dim,
engine="HNSW",
metric="CS",
log_level="INFO"
)
Load the documents into the vectorstore
Finally, we come to the part where we load the documents into the vectorstore. Again, this will take a little while to run.
The full process takes a while, so we've restricted it here to a few thousand documents so you can progress through the notebook. You can remove this limit and go for lunch instead.
Once you add the documents, your ApertureDB instance will be hard at work building a high-performance index for them.
ids = vectorstore.add_documents(segments)
Let's check out how many documents are in our vectorstore.
import json
print(json.dumps([ d for d in ApertureDB.list_vectorstores() if d['_name'] == DESCRIPTOR_SET ], indent=2))
Tidy up
If you want to tidy up and restore your ApertureDB instance to before, you can delete the vectorstore.
We've deliberately left this next box not executable so you can go on to use your database.
ApertureDB.delete_vectorstore(DESCRIPTOR_SET)
What's next?
Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.
See Building a RAG Chain from a Website.