Ingesting a Website into ApertureDB
Introduction
In this notebook, we will demonstrate how to prepare for running a RAG chain.
We will:
- Crawl a website
- Scrape its contents
- Split it into parts
- Generate embeddings
- Load the documents into ApertureDB.
Setup and Installations
Data management with ApertureDB
ApertureDB is set up as a database (server) and can be accessed from clients anywhere as long as the server is accessible on the network to the client.
Sign up for an Aperture cloud account here (30 days free trial) or see other methods here
Connection method with ApertureDB
Installation instructions for the various packages needed for this application are as follows:
%pip install --quiet --upgrade aperturedb langchain langchain-community langchainhub scrapy gpt4all
Note: you may need to restart the kernel to use updated packages.
Connect ApertureDB Client and Server
Detailed instructions for configuring your client can be found on this page.
!adb config create --overwrite --active --from-json rag_demo
As a demonstration that you have connected to the server successfully, let's see a summary of the database schema.
The first time you do this, you may need to grant permission for this notebook to access your secrets.
from aperturedb.Utils import Utils
from aperturedb.CommonLibrary import create_connector
# Create the connector for ApertureDB
client = create_connector()
# Use the connector to create a Utils object and print the summary
utils = Utils(client)
utils.summary()
================== Summary ==================
Database: ragdemo-3h5g50ie.farm0000.cloud.aperturedata.io
Version: 0.18.3
Status: 0
Info: OK
------------------ Entities -----------------
Total entities types: 2
_Descriptor
Total elements: 1420
String | _label | 1420 (100%)
String | lc_title | 1420 (100%)
String | lc_url | 1420 (100%)
String | text | 1420 (100%)
I String | uniqueid | 1420 (100%)
_DescriptorSet
Total elements: 1
Number | _dimensions | 1 (100%)
I String | _name | 1 (100%)
---------------- Connections ----------------
Total connections types: 1
_DescriptorSetToDescriptor
_DescriptorSet ====> _Descriptor
Total elements: 1420
------------------ Totals -------------------
Total nodes: 1421
Total edges: 1420
=============================================
Imports
We need to import some modules.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.http import HtmlResponse
from langchain_core.documents import Document
from langchain_community.vectorstores import ApertureDB
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from urllib.parse import urlparse
import argparse
import os
import logging
Crawl the Website
We're going to use the scrapy
package to crawl a website for us.
We're going to add a small wrapper around it to make it play well with LangChain.
LangChainSpider
First we create a wrapper for scrapy
's CrawlSpider
that generates LangChain Document
s.
class LangChainSpider(CrawlSpider):
name = "langchain_spider"
rules = [Rule(LinkExtractor(), callback='parse', follow=True)]
start_urls = ["https://docs.aperturedata.io/"]
_follow_links = True
def __init__(self, start_url, css_selector=None, **kwargs):
"""LangChain Spider
Args:
start_url (str): The URL to start crawling from
css_selector (str, optional): The CSS selector to use to extract text from the page. Defaults to None.
"""
super().__init__(**kwargs)
self.start_urls = (start_url,)
# Extract the domain from the URL; we only want to crawl the same domain
self.allowed_domains = list(
set([urlparse(url).netloc for url in self.start_urls]))
self.css_selector = css_selector
@classmethod
def from_crawler(cls, crawler, **kwargs):
"""Factory method to create a new instance of the spider
Gets arguments from crawler settings.
Args:
crawler (Crawler): The Scrapy Crawler instance
Returns:
LangChainSpider: A new instance of the spider
"""
settings = crawler.settings
args = settings.get("LANGCHAIN_PIPELINE_ARGS", {})
spider = cls(start_url=args.start_url,
css_selector=args.selector, crawler=crawler, **kwargs)
return spider
def parse(self, response):
"""Parse the response from the page and yield a Document
Args:
response: The parsed response from the page
Yields:
Document: A LangChain document object containing the page content
"""
if isinstance(response, HtmlResponse): # Ignore anything that is not HTML
if self.css_selector:
elements = response.css(self.css_selector).xpath(".//text()").getall()
else:
elements = response.xpath('//body//text()').getall()
content = "\n".join(elements).strip()
title = response.css("title::text").get() # extract the title of the page
logging.info(f"URL: {response.url}, Title: {title} Content: {len(content)}")
if content:
doc = Document(
page_content=content,
id=response.url, # Use the URL as the document ID
metadata={
"url": response.url,
"title": title,
}
)
yield doc
else:
logging.warning(f"Empty content for URL: {response.url}")
LangChainPipeline
Now we create a pipeline that's going to be called by the crawler to process those documents. This is the part where we call ApertureDB.
class LangChainPipeline:
def __init__(self, vectorstore, splitter=None):
"""Crawler pipeline for taking LangChain documents and adding them to a vector store
Args:
vectorstore (VectorStore): The vector store to add the documents to
splitter (function, optional): A function to split the documents into smaller chunks. Defaults to None.
"""
self.vectorstore = vectorstore
self.splitter = splitter
@classmethod
def from_crawler(cls, crawler):
"""Factory method to create a new instance of the pipeline
Gets arguments from crawler settings.
Args:
crawler (Crawler): The Scrapy Crawler instance
Returns:
LangChainPipeline: A new instance of the pipeline
"""
settings = crawler.settings
args = settings.get("LANGCHAIN_PIPELINE_ARGS", {})
# The embeddings are a GPT4ALL model
embeddings = GPT4AllEmbeddings(model_name=args.embeddings)
embeddings_dim = len(embeddings.embed_query("test"))
# The vector store is an ApertureDB instance
vectorstore = ApertureDB(descriptor_set=args.descriptorset,
embeddings=embeddings,
dimensions=embeddings_dim)
# The splitter is a RecursiveCharacterTextSplitter, configured from arguments
splitter = RecursiveCharacterTextSplitter(
chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap).split_documents
return cls(vectorstore=vectorstore, splitter=splitter)
def process_item(self, doc, spider):
"""Process the document and add it to the vector store
Args:
doc (Document): The LangChain document object
spider (LangChainSpider): The spider that parsed the document
"""
docs = [doc]
if self.splitter:
docs = self.splitter(docs)
logging.info(f"Splitting document into {len(docs)} chunks")
self.vectorstore.add_documents(docs)
Configuration
Configure our crawl.
log_level = "INFO"
max_pages = 1000
concurrent_requests_per_domain = 32
concurrent_requests = 64
class Args:
start_url = "https://docs.aperturedata.io/"
descriptorset = "test"
chunk_size = 512
chunk_overlap = 64
embeddings = "all-MiniLM-L6-v2.gguf2.f16.gguf"
selector = ".markdown"
args = Args()
Do the crawl
Create the crawler
crawler = CrawlerProcess(
settings={
"LOG_LEVEL": log_level,
"ITEM_PIPELINES": {
LangChainPipeline: 1000,
},
"LANGCHAIN_PIPELINE_ARGS": args,
# Limit number of pages processed (not crawled)
"CLOSESPIDER_ITEMCOUNT": max_pages,
'CONCURRENT_REQUESTS_PER_DOMAIN': concurrent_requests_per_domain,
'CONCURRENT_REQUESTS': concurrent_requests,
}
)
2024-12-20 17:52:41 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-20 17:52:41 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0], pyOpenSSL 24.3.0 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform Linux-5.15.0-122-generic-x86_64-with-glibc2.35
And run the crawl. This will take several minutes.
# Delete the vector store before starting
ApertureDB.delete_vectorstore(args.descriptorset)
crawler.crawl(LangChainSpider)
crawler.start()
Results
We can list the vectorstores in our ApertureDB instance.
Notice the count
field for the test
vectorstore.
Remember that this is the number of segments, which will be more than the number of pages crawled.
ApertureDB.list_vectorstores()
2024-12-20 17:57:47 [aperturedb.CommonLibrary] WARNING: Utils.create_connector is deprecated and will be removed in a future release. Use CommonLibrary.create_connector instead.
2024-12-20 17:57:47 [aperturedb.CommonLibrary] INFO: Using active configuration 'rag_demo'
2024-12-20 17:57:47 [aperturedb.CommonLibrary] INFO: Configuration: [ragdemo-3h5g50ie.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]
[{'_count': 1850,
'_dimensions': 384,
'_engines': ['HNSW'],
'_metrics': ['CS'],
'_name': 'test',
'_uniqueid': '2.0.17280'}]
What's next?
Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.
See Building a RAG Chain from a Website.