Parallel Ingestion Alternatives
Load large numbers of embeddings efficiently using ApertureDB's ParallelLoader. It ingests data concurrently using multiple threads, making it the right tool when you have thousands or millions of vectors to load.
Bulk Embeddings — generate sentence-transformer embeddings from the Cookbook dataset and ingest them with ParallelLoader
How It Works
ParallelLoader takes any Subscriptable — an object that implements __len__ and __getitem__ returning (query, blobs) pairs. It batches and parallelizes the writes.
from aperturedb.ParallelLoader import ParallelLoader
loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)
Custom Generator
When embeddings come from a model or a custom source, implement a generator class:
from aperturedb.ParallelLoader import ParallelLoader
class DescriptorGenerator:
"""Yields (query, blobs) pairs for ParallelLoader."""
def __init__(self, items, embeddings, set_name):
self.items = items
self.embeddings = embeddings
self.set_name = set_name
def __len__(self):
return len(self.items)
def __getitem__(self, idx):
# ParallelLoader passes a slice for each batch — return a list of pairs.
if isinstance(idx, slice):
return [self[i] for i in range(*idx.indices(len(self)))]
item = self.items[idx]
emb = self.embeddings[idx].astype("float32")
query = [{
"AddDescriptor": {
"set": self.set_name,
"properties": {
"title": item["title"],
"category": item["category"],
},
"if_not_found": {"title": ["==", item["title"]]},
}
}]
return query, [emb.tobytes()]
generator = DescriptorGenerator(items, embeddings, "my_set")
loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)
if_not_found makes ingestion idempotent — re-running the notebook won't create duplicates.
CSV + NPZ Files
For pre-computed embeddings on disk, use DescriptorDataCSV with a CSV that maps each row to a vector in a .npz file:
filename,index,set,label,source,constraint_source
/data/embeddings.npz,0,my_set,label_a,doc-001,doc-001
/data/embeddings.npz,1,my_set,label_b,doc-002,doc-002
from aperturedb.DescriptorDataCSV import DescriptorDataCSV
from aperturedb.ParallelLoader import ParallelLoader
data = DescriptorDataCSV("/path/to/descriptors.csv")
loader = ParallelLoader(client)
loader.ingest(data, batchsize=100, numthreads=4, stats=True)
Or use the CLI directly:
adb ingest from-csv --descriptor-data descriptors.csv
What's Next
- Bulk Embeddings notebook — run this interactively on the Cookbook dataset
- Hybrid Search — add metadata filters to KNN search
- Building RAG Pipelines — use bulk-loaded embeddings as a retrieval store
ParallelLoaderreference — full parameter list- Embeddings Extraction workflow — production-scale ingestion via the Workflows UI