Skip to main content

Image Embedding Models

ApertureDB stores images and their embeddings together, linked by a graph edge. A KNN query can traverse from matching descriptors directly to image blobs — no separate fetch step.

Runnable Notebooks

For setup and client configuration, see Client Configuration. For server setup options, see Server Setup.


CLIP

CLIP embeds images and text into the same vector space, enabling text-to-image and image-to-image search. The clip-ViT-B-32 model from sentence-transformers is the simplest way to use it without PyTorch boilerplate:

pip install -U aperturedb sentence-transformers Pillow requests
import requests
import numpy as np
from PIL import Image
from io import BytesIO
from sentence_transformers import SentenceTransformer
from aperturedb.CommonLibrary import create_connector

client = create_connector()
model = SentenceTransformer("clip-ViT-B-32") # 512-dimensional

# Create DescriptorSet
client.query([{"AddDescriptorSet": {
"name": "food_image_search",
"dimensions": 512,
"engine": "HNSW",
"metric": "CS",
}}])

# Add image + embedding in one transaction
image_url = "https://example.com/butter_chicken.jpg"
resp = requests.get(image_url, timeout=10)
img = Image.open(BytesIO(resp.content)).convert("RGB")
emb = model.encode(img, normalize_embeddings=True).astype("float32")

client.query(
[
{"AddImage": {"url": image_url, "_ref": 1, "properties": {"dish": "Butter Chicken", "cuisine": "Indian"}}},
{"AddDescriptor": {"set": "food_image_search", "connect": {"ref": 1, "class": "has_embedding"}, "properties": {"dish": "Butter Chicken"}}},
],
[emb.tobytes()]
)

Text-to-image search — CLIP text and image embeddings are comparable, so a text query returns visually matching images:

query_emb = model.encode("creamy curry", normalize_embeddings=True).astype("float32")

q = [
{"FindDescriptor": {"set": "food_image_search", "k_neighbors": 5, "distances": True, "_ref": 1}},
{"FindImage": {"is_connected_to": {"ref": 1, "class": "has_embedding"}, "blobs": True, "results": {"all_properties": True}}},
]
response, blobs = client.query(q, [query_emb.tobytes()])

The FindDescriptorFindImage traversal returns matched images and metadata in one round trip.


Ingesting the Cookbook Dataset

The Cookbook dataset (20+ dish photos) can be ingested with CLIP embeddings in one command using the ApertureDB CLI:

wget https://github.com/aperture-data/Cookbook/raw/refs/heads/main/scripts/load_cookbook_data.sh
bash load_cookbook_data.sh

This ingests all dish images with CLIP ViT-B/16 embeddings stored in a ViT-B/16 DescriptorSet. After ingestion, the Quick Start notebook's section 5c runs text-to-image search over all dish photos.


FaceNet

For a large-scale example with a different model, the CelebA Face Similarity Search walkthrough uses FaceNet embeddings on 200k+ celebrity images with metadata-filtered KNN search (hair color, glasses, age).


Structured Ingestion with DataModels

For bulk ingestion using typed Pydantic schemas, see Structured Ingestion with DataModels. This approach is used in the Cookbook dataset loader and the CelebA similarity search example.


What's Next