Ingesting a Website into ApertureDB

Introduction

In this notebook, we will demonstrate how to prepare for running a RAG chain.

We will:

Crawl a website
Scrape its contents
Split it into parts
Generate embeddings
Load the documents into ApertureDB.

Setup and Installations

Data management with ApertureDB

ApertureDB is set up as a database (server) and can be accessed from clients anywhere as long as the server is accessible on the network to the client.

Connection method with ApertureDB

Installation instructions for the various packages needed for this application are as follows:

%pip install --quiet --upgrade aperturedb langchain langchain-community langchainhub scrapy gpt4all

Note: you may need to restart the kernel to use updated packages.

Connect ApertureDB Client and Server

Detailed instructions for configuring your client can be found on this page.

!adb config create --overwrite --active --from-json rag_demo

As a demonstration that you have connected to the server successfully, let's see a summary of the database schema.
The first time you do this, you may need to grant permission for this notebook to access your secrets.

from aperturedb.Utils import Utils
from aperturedb.CommonLibrary import create_connector

# Create the connector for ApertureDB
client = create_connector()

# Use the connector to create a Utils object and print the summary
utils = Utils(client)
utils.summary()

================== Summary ==================
Database: ragdemo-3h5g50ie.farm0000.cloud.aperturedata.io
Version: 0.18.3
Status:  0
Info:    OK
------------------ Entities -----------------
Total entities types:    2
_Descriptor         
  Total elements: 1420
    String   | _label    |      1420 (100%)
    String   | lc_title  |      1420 (100%)
    String   | lc_url    |      1420 (100%)
    String   | text      |      1420 (100%)
I   String   | uniqueid  |      1420 (100%)
_DescriptorSet      
  Total elements: 1
    Number   | _dimensions  |         1 (100%)
I   String   | _name        |         1 (100%)
---------------- Connections ----------------
Total connections types: 1
_DescriptorSetToDescriptor
  _DescriptorSet ====> _Descriptor
  Total elements: 1420
------------------ Totals -------------------
Total nodes: 1421
Total edges: 1420
=============================================

Imports

We need to import some modules.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.http import HtmlResponse
from langchain_core.documents import Document
from langchain_community.vectorstores import ApertureDB
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from urllib.parse import urlparse
import argparse
import os
import logging

Crawl the Website

We're going to use the scrapy package to crawl a website for us. We're going to add a small wrapper around it to make it play well with LangChain.

LangChainSpider

First we create a wrapper for scrapy's CrawlSpider that generates LangChain Documents.

class LangChainSpider(CrawlSpider):
    name = "langchain_spider"
    rules = [Rule(LinkExtractor(), callback='parse', follow=True)]
    start_urls = ["https://docs.aperturedata.io/"]
    _follow_links = True

    def __init__(self, start_url, css_selector=None, **kwargs):
        """LangChain Spider

        Args:
            start_url (str): The URL to start crawling from
            css_selector (str, optional): The CSS selector to use to extract text from the page. Defaults to None.
        """
        super().__init__(**kwargs)
        self.start_urls = (start_url,)
        # Extract the domain from the URL; we only want to crawl the same domain
        self.allowed_domains = list(
            set([urlparse(url).netloc for url in self.start_urls]))
        self.css_selector = css_selector

    @classmethod
    def from_crawler(cls, crawler, **kwargs):
        """Factory method to create a new instance of the spider

        Gets arguments from crawler settings.

        Args:
            crawler (Crawler): The Scrapy Crawler instance

        Returns:
            LangChainSpider: A new instance of the spider
        """
        settings = crawler.settings
        args = settings.get("LANGCHAIN_PIPELINE_ARGS", {})
        spider = cls(start_url=args.start_url,
                     css_selector=args.selector, crawler=crawler, **kwargs)
        return spider

    def parse(self, response):
        """Parse the response from the page and yield a Document

        Args:
            response: The parsed response from the page

        Yields:
            Document: A LangChain document object containing the page content
        """
        if isinstance(response, HtmlResponse):  # Ignore anything that is not HTML
            if self.css_selector:
                elements = response.css(self.css_selector).xpath(".//text()").getall()
            else:
                elements = response.xpath('//body//text()').getall()

            content = "\n".join(elements).strip()
            title = response.css("title::text").get() # extract the title of the page
            logging.info(f"URL: {response.url}, Title: {title} Content: {len(content)}")
            if content:
                doc = Document(
                    page_content=content,
                    id=response.url, # Use the URL as the document ID
                    metadata={
                        "url": response.url,
                        "title": title,
                    }
                )
                yield doc
            else:
                logging.warning(f"Empty content for URL: {response.url}")

LangChainPipeline

Now we create a pipeline that's going to be called by the crawler to process those documents. This is the part where we call ApertureDB.

class LangChainPipeline:
    def __init__(self, vectorstore, splitter=None):
        """Crawler pipeline for taking LangChain documents and adding them to a vector store

        Args:
            vectorstore (VectorStore): The vector store to add the documents to
            splitter (function, optional): A function to split the documents into smaller chunks. Defaults to None.
        """
        self.vectorstore = vectorstore
        self.splitter = splitter

    @classmethod
    def from_crawler(cls, crawler):
        """Factory method to create a new instance of the pipeline

        Gets arguments from crawler settings.

        Args:
            crawler (Crawler): The Scrapy Crawler instance

        Returns:
            LangChainPipeline: A new instance of the pipeline
        """
        settings = crawler.settings
        args = settings.get("LANGCHAIN_PIPELINE_ARGS", {})

        # The embeddings are a GPT4ALL model
        embeddings = GPT4AllEmbeddings(model_name=args.embeddings)
        embeddings_dim = len(embeddings.embed_query("test"))

        # The vector store is an ApertureDB instance
        vectorstore = ApertureDB(descriptor_set=args.descriptorset,
                                 embeddings=embeddings,
                                 dimensions=embeddings_dim)

        # The splitter is a RecursiveCharacterTextSplitter, configured from arguments
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap).split_documents

        return cls(vectorstore=vectorstore, splitter=splitter)

    def process_item(self, doc, spider):
        """Process the document and add it to the vector store
        
        Args:
            doc (Document): The LangChain document object
            spider (LangChainSpider): The spider that parsed the document
        """
        docs = [doc]
        if self.splitter:
            docs = self.splitter(docs)
            logging.info(f"Splitting document into {len(docs)} chunks")
        self.vectorstore.add_documents(docs)

Configuration

Configure our crawl.

log_level = "INFO"
max_pages = 1000
concurrent_requests_per_domain = 32
concurrent_requests = 64
class Args:
    start_url = "https://docs.aperturedata.io/"
    descriptorset = "test"
    chunk_size = 512
    chunk_overlap = 64
    embeddings = "all-MiniLM-L6-v2.gguf2.f16.gguf"
    selector = ".markdown"
args = Args()

Do the crawl

Create the crawler

crawler = CrawlerProcess(
    settings={
        "LOG_LEVEL": log_level,
        "ITEM_PIPELINES": {
            LangChainPipeline: 1000,
        },
        "LANGCHAIN_PIPELINE_ARGS": args,
        # Limit number of pages processed (not crawled)
        "CLOSESPIDER_ITEMCOUNT": max_pages,
        'CONCURRENT_REQUESTS_PER_DOMAIN': concurrent_requests_per_domain,
        'CONCURRENT_REQUESTS': concurrent_requests,
    }
)

2024-12-20 17:52:41 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-20 17:52:41 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0], pyOpenSSL 24.3.0 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform Linux-5.15.0-122-generic-x86_64-with-glibc2.35

And run the crawl. This will take several minutes.

# Delete the vector store before starting
ApertureDB.delete_vectorstore(args.descriptorset)

crawler.crawl(LangChainSpider)
crawler.start()

Results

We can list the vectorstores in our ApertureDB instance. Notice the count field for the test vectorstore. Remember that this is the number of segments, which will be more than the number of pages crawled.

ApertureDB.list_vectorstores()

2024-12-20 17:57:47 [aperturedb.CommonLibrary] WARNING: Utils.create_connector is deprecated and will be removed in a future release. Use CommonLibrary.create_connector instead.
2024-12-20 17:57:47 [aperturedb.CommonLibrary] INFO: Using active configuration 'rag_demo'
2024-12-20 17:57:47 [aperturedb.CommonLibrary] INFO: Configuration: [ragdemo-3h5g50ie.farm0000.cloud.aperturedata.io:55555 as admin using TCP with SSL=True]

[{'_count': 1850,
  '_dimensions': 384,
  '_engines': ['HNSW'],
  '_metrics': ['CS'],
  '_name': 'test',
  '_uniqueid': '2.0.17280'}]

What's next?

Next you want to use this vectorstore to drive a RAG (Retrieval-Augmented Generation) chain.

See Building a RAG Chain from a Website.

Introduction​

Setup and Installations​

Data management with ApertureDB​

Connection method with ApertureDB​

Connect ApertureDB Client and Server​

Imports​

Crawl the Website​

LangChainSpider​

LangChainPipeline​

Configuration​

Do the crawl​

Results​

What's next?​

Further information​