Skip to main content

Parallel Ingestion Alternatives

Load large numbers of embeddings efficiently using ApertureDB's ParallelLoader. It ingests data concurrently using multiple threads, making it the right tool when you have thousands or millions of vectors to load.

Runnable Notebook

Bulk Embeddings — generate sentence-transformer embeddings from the Cookbook dataset and ingest them with ParallelLoader


How It Works

ParallelLoader takes any Subscriptable — an object that implements __len__ and __getitem__ returning (query, blobs) pairs. It batches and parallelizes the writes.

from aperturedb.ParallelLoader import ParallelLoader

loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)

Custom Generator

When embeddings come from a model or a custom source, implement a generator class:

from aperturedb.ParallelLoader import ParallelLoader

class DescriptorGenerator:
"""Yields (query, blobs) pairs for ParallelLoader."""

def __init__(self, items, embeddings, set_name):
self.items = items
self.embeddings = embeddings
self.set_name = set_name

def __len__(self):
return len(self.items)

def __getitem__(self, idx):
# ParallelLoader passes a slice for each batch — return a list of pairs.
if isinstance(idx, slice):
return [self[i] for i in range(*idx.indices(len(self)))]
item = self.items[idx]
emb = self.embeddings[idx].astype("float32")
query = [{
"AddDescriptor": {
"set": self.set_name,
"properties": {
"title": item["title"],
"category": item["category"],
},
"if_not_found": {"title": ["==", item["title"]]},
}
}]
return query, [emb.tobytes()]

generator = DescriptorGenerator(items, embeddings, "my_set")
loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)

if_not_found makes ingestion idempotent — re-running the notebook won't create duplicates.


CSV + NPZ Files

For pre-computed embeddings on disk, use DescriptorDataCSV with a CSV that maps each row to a vector in a .npz file:

filename,index,set,label,source,constraint_source
/data/embeddings.npz,0,my_set,label_a,doc-001,doc-001
/data/embeddings.npz,1,my_set,label_b,doc-002,doc-002
from aperturedb.DescriptorDataCSV import DescriptorDataCSV
from aperturedb.ParallelLoader import ParallelLoader

data = DescriptorDataCSV("/path/to/descriptors.csv")
loader = ParallelLoader(client)
loader.ingest(data, batchsize=100, numthreads=4, stats=True)

Or use the CLI directly:

adb ingest from-csv --descriptor-data descriptors.csv

What's Next