Data Input & Output Formats
When sending data to or retrieving data from ApertureDB, encoding and decoding are common questions because ApertureDB deals with unstructured data.
For loading, CSV loaders will usually handle encoding issues automatically, whether sending Images or Descriptors.
But when querying or modifying within custom code, knowing how to send and receive data is important.
Image Encoding
Images are one of the most common data types stored in ApertureDB, so let's start with them.
Retrieving Images
Since storing images in ApertureDB is often handled via the data loader tools, we'll also start with the most common task:
querying for an image, then retrieving it. First we'll get a basic query down, then we'll do multiple images and
multiple queries.
Basic Image Retrieval
Let's say that we have an ApertureDB database where we have stored images with a metadata property called $uuid$ which uniquely sorts them, then the following code is a good way to retrieve an image:
query = [{
"FindImage": {
"blobs": True, # Return image data
"constraints": {
"uuid": ["==","127e46bf-48c0-42c1-8c70-6c0e24252c15"] # search by a unique property to retrieve only one matching image
}
}
}]
results,blobs = db.query(query)
But what format is that blob in, and how to you get it into the format you want?
As you may know, blobs, returned by db.query(), is a list, and each result will give you an index into the array
so you can find the results. We'll deal with that later, but first retrieving the image:
if db.last_query_ok() and results[0]["FindImage"]["returned"] != 0:
with open( 'output.jpg','wb' ) as file:
file.write(blobs[0])
The blobs[0]
object returned is a python bytes object. If you want to then use the image in the PIL library,
you can then do:
from PIL import Image
im = Image.open('output.jpg')
print(f"width: {im.width} height: {im.height}")
without saving, you can use BytesIO to stream into PIL:
from io import BytesIO
as_stream = BytesIO(b[0])
im = Image.open(as_stream)
print(f"width: {im.width} height: {im.height}")
Multiple Image Retrieval
That was pretty simple, but what about multiple images? We assume "dog" will return more than 1 image:
query = [{
"FindImage": {
"blobs": True,
"limit":10,
"constraints": {
"animal" : ["==","dog"]
}
}
}]
results,blobs = db.query(query)
If we don't know how many we have beforehand, how can we deal with this? As we saw above, the 'returned' property will tell us how many images we have.
We also add limit because we don't know how many it will return, and we might not want to deal with 1000 dog images.
if db.last_query_ok() and results[0]["FindImage"]["returned"] != 0:
for i in range(results[0]["FindImage"]["returned"]):
with open( f'dog_{i}.jpg','wb' ) as file:
file.write(blobs[i])
And like that we can have all the dogs we can handle.
Speaking of handling, if you have a lot of images or blobs to handle, you might want to use ParallelQuery and a QueryGenerator. It will allow you to use multiple threads, retrieve in batches, but still only deal with each of them individually.
Complex Queries
Lastly we will have a query which groups results by their source. We utilize a load which uses data from (Embedding Encoding)[#Embedding Encoding]. Follow the instructions to download and uncompress it. Then run:
from aperturedb.ParallelLoader import ParallelLoader
from aperturedb.ImageDataCSV import ImageDataCSV
from aperturedb.EntityDataCSV import EntityDataCSV
from aperturedb.ConnectionDataCSV import ConnectionDataCSV
data = ImageDataCSV("animals/image.adb.csv", blobs_relative_to_csv=True)
categories = EntityDataCSV("animals/classes.adb.csv")
connections = ConnectionDataCSV("animals/connection.adb.csv")
loader = ParallelLoader(db)
loader.ingest(data)
loader.ingest(categories)
loader.ingest(connections)
This will load the database with animal entities with their scientific name, which are linked to images of animals.
Our complex query searches for animals by their common name, returns their scientific name, then finds the images associated with those animals.
We use _uniqueid to link
scientific names with the images.
query=[{
"FindEntity": {
"with_class": "animal",
"_ref":1,
"results": {
"list": ["scientific_name", "_uniqueid"]
},
"constraints": {
"name": ["in,["Hedgehog", "Harbor seal"]]
}
}
},{
"FindImage": {
"is_connected_to": {
"ref": 1
},
"results": {
"list": ["cute_id"]
},
"blobs": True,
"group_by_source": True
}
}]
r,b = db.query(query)
# create map from id to name.
id_to_name = { (e["_uniqueid"]: e["scientific_name"]) for e in r[0]["FindEntity"]["entities"] }
image_results = r[1]["FindImage"]["entities"]
for group_id in image_results.keys()
i = 0
for img in image_results[group_id]:
i = i + 1
filename= f"{id_to_name[group_id]}_{img['cute_id']}.jpg"
with open(filename ,'wb') as fp:
fp.write(b[img['_blob_index']])
Now you will have images of the animals, saved by their scientific name, identified by their "cute_id".
Uploading Images
For bulk loading of images, ImageDataCSV is usually recommended, but sometimes you need to trigger a load inside of other code.
Uploading a single Image
query=[{
"AddImage": {
"properties": {
"type": "house",
"addr_street": "123 Maple Street",
"addr_city": "Treetown",
"addr_state": "NY",
"addr_country": "US"
"retail_uuid": "e08c0f92-b08f-439d-ab94-345a6e137dc7"
},
"if_not_found": {
"retail_uuid": ["==","e08c0f92-b08f-439d-ab94-345a6e137dc7"]
}
}
}]
with open('house.jpg','rb') as fp:
image = fp.read()
result,_ = db.query( query, [image] )
if db.last_query_ok():
if result.status == 0:
print("Added Image!")
elif result.status == 2:
print("Image exists!")
A status of 0 means the AddImage completed ok. 2 means the the image wasn't added because the if_not_found condition was triggered.
Embedding Encoding
The loading of embeddings/descriptors is much like loading of images, but with a few crucial differences. The first important note is that the embedding cannot be a sparse representation without the 0s — the size must match the size defined by the set. The next note is that all embeddings must be part of a set.
Setting up a simple model
We use a prepared dataset for embedding examples, download it here. These are public images which are CC-0 and CC-BY 1
Once you've downloaded the file:
unzip input-output-dataset.zip
This should create a directory called animals
with some images.
To ensure the consistency of the example, we utilize a python package to allow
simple input and output of embeddings.
pip install imgbeddings torch # need pytorch or tensorflow.
This module utilizes CLIP to create a 768D output. You may wish to install this using venv, as it has dependencies.
first load an image using PIL:
from PIL import Image
from io import BytesIO
image = Image.open('animals/tabby-cat.jpg')
then use imgbeggings to create an embedding. It will need to download a model on first run, but it will cache it.
from imgbeddings import imgbeddings
model = imgbeddings()
embeddings = model.to_embeddings(image)
This will give you a numpy array of 2x768, with the first dimension being for multiple arrays.
print(embeddings[0][0:10])
will give you the first 10 values.
Adding an embedding
First if you don't have a set, create that:
query=[{
"AddDescriptorSet": {
"name": "cute animals",
"dimensions": 768,
"engine": "Flat",
"metric": "L2"
}
}]
r,_ = db.query(query)
then we can add the cat image and the embedding all at once.
query=[{
"AddImage": {
"_ref": 1,
"properties": {
"category" : "animals",
"animal" : "cat",
"cute_id": 42
}
}
}, {
"AddDescriptor": {
"set": "cute animals",
"connect": {
"ref":1
}
}
}]
img_bytes = IOBytes()
image.save(img_bytes,format=image.format)
r,_ = db.query( query, [img_bytes.getvalue(),embeddings[0].tobytes()])
If your embedding is a numpy ndarray just use tobytes
to convert it to
an ingestable format.
A PIL image also must be bytes. To create a bytes object from a PIL image, just follow the above. If you don't need to read/verify images first, you can just use open.
Now you have added a new image, and the embedding that was generated for it. If you already have an image in the db, subsititute FindImage instead of add, and use the same AddDescriptor.
Retrieving an Embedding
There are multiple ways to retrieve an embedding - either by directly querying
it by filtering on a property value, by traversing a linked item, or by comparing descriptors and finding related
ones.
First we'll retrieve an embedding by connections in the graph:
query=[{
"FindImage": {
"_ref":1,
"constraints": {
"cute_id": ["==", 42]
}
}
},{
"FindDescriptor": {
"set" : "cute animals",
"is_connected_to": { # this helps traverse the relationship
"ref":1
},
"blobs":True
}
}]
r, b = db.query(query )
None if b[0] == embeddings[0].tobytes() else error("Consistency check failed!)
As we can see, we have the same embedding as we started with.
To retrieve using neighbors, we'll need more descriptors, so lets work on that
first:
Adding Multiple Embeddings
We will use the dataset from the beginning of the section here. We will loop over all the files, and create commands to add images and embeddings for
each.
from pathlib import Path
query = []
blobs = []
i = 1
for p in Path("animals").iterdir():
parts = p.stem.split('_')
if len(parts) < 2:
continue
(animal,cute_id) = parts
# create the query for each item
query.extend( [{
"AddImage": {
"_ref": i,
"properties": {
"cute_id": int(cute_id),
"animal": animal.replace('-',' ')
}
}
},{
"AddDescriptor": {
"set" :"cute animals",
"connect": {
"ref": i
}
}
}] )
# prepare the data
image = Image.open(p)
embeddings = model.to_embeddings(image)
img_bytes = io.BytesIO()
image.save(img_bytes,format=image.format)
blobs.extend([img_bytes.getvalue(),embeddings[0].tobytes()])
i = i + 1
results, _ = db.query( query, blobs)
After running this code, you should have a database full of (mostly) cute animals.
Note that if you have a lot of data to add ( more than a thousand embeddings ), a QueryGenerator driven by a ParallelQuery will handle this best.
Retrieving Embeddings by Similarity Search
Now that we have a database we can use as an example, we can use a new embedding to find results from our existing set. Note, we will use the same method for generation as before, but pass it to FindDesciptor. After that, we will use the connection between the descriptor and its source image to actually retrieve the connected image as well.
# First create our new embedding
image = Image.open("black-cat.jpg")
embeddings = model.to_embeddings(image)
# now use it for a neighbor search.
query=[{
"FindDescriptor": {
"_ref": 1,
"set": "cute animals",
"k_neighbors": 5,
"distances": True
}
}, {
"FindImage": {
"is_connected_to": {
"ref": 1
},
"blobs": True,
"results": {
"list": ["cute_id","animal"]
}
}
}]
results,blobs = db.query( query, [embeddings[0].tobytes()] )
# Finally, we take the results of our search and save them.
descriptors = results[0]["FindDescriptor"]["entities"]
images = rresults[1]["FindImage"]["entities"]
for (n,image) in enumerate(images):
filename="neighbor_{}_{}.jpg".format( image['animal'], image['cute_id'])
print("distance = {}".format( descriptors[n]["_distance"] ))
with open( filename, 'wb' ) as fp:
fp.write( blobs[image['_blob_index']] )
This will output the distance and the images for the 5 neighbors, with their names generated from their animal type and their id.
Closing
As we can see, handling input and output with ApertureDB is quite easy. The important
factor is that it is delivered as bytes
objects for both input and output, and
you need to change it to the object types you need.