Skip to main content

Bulk Embedding Ingestion

Open In Colab Download View source on GitHub

Load embeddings at scale using ApertureDB's ParallelLoader. This notebook downloads the Cookbook dataset (20 dishes), generates text embeddings with sentence-transformers, and ingests them in parallel.

Connect to ApertureDB

Option A: ApertureDB Cloud (recommended)
Sign up for a free 30-day trial. Get your key from Connect > Generate API Key, add it to a .env file in this directory:

APERTUREDB_KEY=your_key_here

Option B: Community Edition (local Docker)
Run this in a terminal before starting the notebook:

docker run -d --name aperturedb \
-p 55555:55555 -e ADB_MASTER_KEY=admin -e ADB_FORCE_SSL=false \
aperturedata/aperturedb-community

See client configuration options for all connection methods and server setup options for deployment choices.

%pip install --upgrade --quiet aperturedb python-dotenv sentence-transformers pandas
# Option A: ApertureDB Cloud
from dotenv import load_dotenv
load_dotenv() # loads APERTUREDB_KEY from .env into the environment
True
# Option B: Community Edition (local Docker)
# !adb config create localdb --active \
# --host localhost --port 55555 \
# --username admin --password admin \
# --no-use-ssl --no-interactive
from aperturedb.CommonLibrary import create_connector

client = create_connector()
response, _ = client.query([{"GetStatus": {}}])
client.print_last_response()
[
{
"GetStatus": {
"info": "OK",
"status": 0,
"system": "ApertureDB",
"version": "0.19.6"
}
}
]

Load Dataset and Generate Embeddings

We combine dish_name and caption into a single description, then embed with all-MiniLM-L6-v2 (384-dimensional, CPU-friendly).

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

dishes = pd.read_csv(
"https://raw.githubusercontent.com/aperture-data/Cookbook/refs/heads/main/images.adb.csv"
)
dishes["description"] = dishes["dish_name"] + " - " + dishes["caption"]
print(f"Loaded {len(dishes)} dishes")

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(dishes["description"].tolist(), normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")
Loaded 20 dishes
``````output
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |

Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
``````output
Embedding shape: (20, 384)

Create the DescriptorSet

SET_NAME = "cookbook_bulk"

client.query([{
"AddDescriptorSet": {
"name": SET_NAME,
"dimensions": 384,
"engine": "HNSW",
"metric": "CS",
}
}])
client.print_last_response()
[
{
"AddDescriptorSet": {
"status": 0
}
}
]

Bulk Ingest with ParallelLoader

ParallelLoader ingests data concurrently. It takes any Subscriptable — here we build a simple list of (query, blobs) pairs.

from aperturedb.ParallelLoader import ParallelLoader

class DescriptorGenerator:
"""Subscriptable generator of (query, blobs) pairs for ParallelLoader."""

def __init__(self, dishes, embeddings, set_name):
self.dishes = dishes
self.embeddings = embeddings
self.set_name = set_name

def __len__(self):
return len(self.dishes)

def __getitem__(self, idx):
# ParallelLoader calls __getitem__ with a slice for each batch;
# return a list of (query, blobs) pairs in that case.
if isinstance(idx, slice):
return [self[i] for i in range(*idx.indices(len(self)))]
row = self.dishes.iloc[idx]
emb = self.embeddings[idx].astype("float32")
query = [{
"AddDescriptor": {
"set": self.set_name,
"properties": {
"dish_name": row["dish_name"],
"cuisine": row["food_tags"],
"caption": row["caption"],
},
"if_not_found": {"dish_name": ["==", row["dish_name"]]},
}
}]
return query, [emb.tobytes()]

generator = DescriptorGenerator(dishes, embeddings, SET_NAME)
loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)
Progress: 100%|██████████| 20.0/20.0 [00:02<00:00, 9.95items/s]
``````output
============ ApertureDB Loader Stats ============
Total time (s): 2.011025905609131
Total queries executed: 4
Avg Query time (s): 1.3486077785491943
Query time std: 0.12229802807468174
Avg Query Throughput (q/s): 2.966021747481778
Overall insertion throughput (element/s): 9.945172732094711
Total inserted elements: 20
Total successful commands: 20
=================================================
``````output

Verify the Ingestion

response, _ = client.query([{
"FindDescriptorSet": {
"with_name": SET_NAME,
"results": {"count": True},
}
}])
client.print_last_response()
[
{
"FindDescriptorSet": {
"count": 1,
"returned": 0,
"status": 0
}
}
]

Search the Bulk-Loaded Descriptors

query_text = "creamy tomato curry"
query_emb = model.encode([query_text], normalize_embeddings=True)[0].astype("float32")

response, _ = client.query([{
"FindDescriptor": {
"set": SET_NAME,
"k_neighbors": 3,
"distances": True,
"results": {"all_properties": True},
}
}], [query_emb.tobytes()])

for entity in response[0]["FindDescriptor"].get("entities", []):
score = 1 - entity["_distance"]
print(f" {entity['dish_name']:<30} [{entity['cuisine']}] score={score:.3f}")
  Butter chicken                 [Indian]  score=0.423
paneer bhurji [Indian] score=0.567
waffle, smoothie [American] score=0.573

Cleanup

client.query([{"DeleteDescriptorSet": {"with_name": SET_NAME}}])
client.print_last_response()
[
{
"DeleteDescriptorSet": {
"count": 1,
"status": 0
}
}
]

What's Next