Bulk Embedding Ingestion

Load embeddings at scale using ApertureDB's ParallelLoader. This notebook downloads the Cookbook dataset (20 dishes), generates text embeddings with sentence-transformers, and ingests them in parallel.

Connect to ApertureDB

Option A: ApertureDB Cloud (recommended)
Sign up for a free 30-day trial. Get your key from Connect > Generate API Key, add it to a .env file in this directory:

APERTUREDB_KEY=your_key_here

Option B: Community Edition (local Docker)
Run this in a terminal before starting the notebook:

docker run -d --name aperturedb \
  -p 55555:55555 -e ADB_MASTER_KEY=admin -e ADB_FORCE_SSL=false \
  aperturedata/aperturedb-community

See client configuration options for all connection methods and server setup options for deployment choices.

%pip install --upgrade --quiet aperturedb python-dotenv sentence-transformers pandas

# Option A: ApertureDB Cloud
from dotenv import load_dotenv
load_dotenv()  # loads APERTUREDB_KEY from .env into the environment

True

# Option B: Community Edition (local Docker)
# !adb config create localdb --active \
#     --host localhost --port 55555 \
#     --username admin --password admin \
#     --no-use-ssl --no-interactive

from aperturedb.CommonLibrary import create_connector

client = create_connector()
response, _ = client.query([{"GetStatus": {}}])
client.print_last_response()

[
    {
        "GetStatus": {
            "info": "OK",
            "status": 0,
            "system": "ApertureDB",
            "version": "0.19.6"
        }
    }
]

Load Dataset and Generate Embeddings

We combine dish_name and caption into a single description, then embed with all-MiniLM-L6-v2 (384-dimensional, CPU-friendly).

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

dishes = pd.read_csv(
    "https://raw.githubusercontent.com/aperture-data/Cookbook/refs/heads/main/images.adb.csv"
)
dishes["description"] = dishes["dish_name"] + " - " + dishes["caption"]
print(f"Loaded {len(dishes)} dishes")

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(dishes["description"].tolist(), normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")

Loaded 20 dishes
``````output
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
``````output
Embedding shape: (20, 384)

Create the DescriptorSet

SET_NAME = "cookbook_bulk"

client.query([{
    "AddDescriptorSet": {
        "name":       SET_NAME,
        "dimensions": 384,
        "engine":     "HNSW",
        "metric":     "CS",
    }
}])
client.print_last_response()

[
    {
        "AddDescriptorSet": {
            "status": 0
        }
    }
]

Bulk Ingest with ParallelLoader

ParallelLoader ingests data concurrently. It takes any Subscriptable — here we build a simple list of (query, blobs) pairs.

from aperturedb.ParallelLoader import ParallelLoader

class DescriptorGenerator:
    """Subscriptable generator of (query, blobs) pairs for ParallelLoader."""

    def __init__(self, dishes, embeddings, set_name):
        self.dishes     = dishes
        self.embeddings = embeddings
        self.set_name   = set_name

    def __len__(self):
        return len(self.dishes)

    def __getitem__(self, idx):
        # ParallelLoader calls __getitem__ with a slice for each batch;
        # return a list of (query, blobs) pairs in that case.
        if isinstance(idx, slice):
            return [self[i] for i in range(*idx.indices(len(self)))]
        row = self.dishes.iloc[idx]
        emb = self.embeddings[idx].astype("float32")
        query = [{
            "AddDescriptor": {
                "set": self.set_name,
                "properties": {
                    "dish_name":   row["dish_name"],
                    "cuisine":     row["food_tags"],
                    "caption":     row["caption"],
                },
                "if_not_found": {"dish_name": ["==", row["dish_name"]]},
            }
        }]
        return query, [emb.tobytes()]

generator = DescriptorGenerator(dishes, embeddings, SET_NAME)
loader    = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)

Progress: 100%|██████████| 20.0/20.0 [00:02<00:00, 9.95items/s]
``````output
============ ApertureDB Loader Stats ============
Total time (s): 2.011025905609131
Total queries executed: 4
Avg Query time (s): 1.3486077785491943
Query time std: 0.12229802807468174
Avg Query Throughput (q/s): 2.966021747481778
Overall insertion throughput (element/s): 9.945172732094711
Total inserted elements: 20
Total successful commands: 20
=================================================
``````output

Verify the Ingestion

response, _ = client.query([{
    "FindDescriptorSet": {
        "with_name": SET_NAME,
        "results":   {"count": True},
    }
}])
client.print_last_response()

[
    {
        "FindDescriptorSet": {
            "count": 1,
            "returned": 0,
            "status": 0
        }
    }
]

Search the Bulk-Loaded Descriptors

query_text = "creamy tomato curry"
query_emb  = model.encode([query_text], normalize_embeddings=True)[0].astype("float32")

response, _ = client.query([{
    "FindDescriptor": {
        "set":         SET_NAME,
        "k_neighbors": 3,
        "distances":   True,
        "results":     {"all_properties": True},
    }
}], [query_emb.tobytes()])

for entity in response[0]["FindDescriptor"].get("entities", []):
    score = 1 - entity["_distance"]
    print(f"  {entity['dish_name']:<30} [{entity['cuisine']}]  score={score:.3f}")

  Butter chicken                 [Indian]  score=0.423
  paneer bhurji                  [Indian]  score=0.567
  waffle, smoothie               [American]  score=0.573

Cleanup

client.query([{"DeleteDescriptorSet": {"with_name": SET_NAME}}])
client.print_last_response()

[
    {
        "DeleteDescriptorSet": {
            "count": 1,
            "status": 0
        }
    }
]

What's Next

Hybrid Search: combine KNN with metadata filters
Recipe Text Search: single-item embedding flow with sentence-transformers
Work with Descriptors: Add, Find, Update, Delete for individual descriptors
Embeddings Extraction workflow: production-scale ingestion via the Workflows UI

Connect to ApertureDB​

Load Dataset and Generate Embeddings​

Create the DescriptorSet​

Bulk Ingest with ParallelLoader​

Verify the Ingestion​

Search the Bulk-Loaded Descriptors​

Cleanup​

What's Next​