Working with a Dataset

Loading a dataset.

The quickest way to get started with ApertureDB is to load a simple dataset containing some images and metadata but without a complex schema. Most Kaggle datasets will load into ApertureDB with minimal modifications (for a more complex example, see our celebrity faces loader).

There are multiple ways to load data into ApertureDB, as discussed in Ingest Data And Metadata, but with a pre-existing dataset, the quickest way is to ingest using a CSV file.

For following along with this document, we've produced a subset of the Stanford Cars Dataset to create a small CSV file as a sample of what a CSV file might look like (more examples here). You can use your own dataset if you want, and we will make out things you might need to change.

Tools Needed

To load a dataset, you will need

ApertureDB credentials and instance (you can use the docker instance for a simple server)
A system with Python 3+ installed
ApertureDB Python package ( installed with pip install aperturedb )
A dataset to load and query
Jupyter notebook or just command-line execution of Python samples

Setup

On your system with the aperturedb package installed, first you will need to download your dataset. The sample used in this document is the miniature Standford Cars Dataset. The archive contains images and a csv.

wget https://aperturedata-public.s3.us-west-2.amazonaws.com/sample_images/demo-slim-stanford-car-dataset.zip

Next, uncompress the dataset:

unzip demo-slim-stanford-car-dataset.zip
cd slim-stanford-car-dataset

Create CSV Input

Now we need to make a CSV file that will tell ApertureDB how to load our data.

In this guide we will assume that we are uploading images and associated properties with each image. The columns in the CSV file will be of 3 types

filename - points to the image to load - always the first column, called 'filename', 'url', 's3_url' or 'gs_url'.
constraint - input to limit which files are uploaded (like a primary key check in relational db).
properties - anything else without 'constraint_' as a prefix is basically image metadata

Many datasets will come with a CSV file which has properties, but some many not have any. To make a minimal CSV file in that case you would run something like

echo "filename" > dataset.csv && find -iname \*.jpg >> cars.csv

which would make a CSV file with 1 column containing the names of the files to load.

The Stanford dataset we have provided contains a CSV file called cars.csv. Now that we have a CSV file, we need to ensure it meets our needs.

In a Jupyter cell or input.py, you can add the following and run the cell or try python input.py:

import pandas as pd
df = pd.read_csv('cars.csv')
print(df)

Output:

    filename        make    model  year   color    style   angle
00001.jpg        audi       a7  2012   white    sedan  angled
00002.jpg       acura       tl  2014   black    sedan  angled
00003.jpg       dodge      ram  2009     red    truck  angled
00004.jpg     hyundai   sonata  2015     red    sedan    rear
00005.jpg        ford    f-350  2016   white    truck  angled
00006.jpg       izuzu   tracer  2007     red  compact  angled
00007.jpg       dodge  journey  2015    blue      suv  angled
00008.jpg       dodge  charger  2017     red   sports  rangle
00009.jpg  mitsubishi   galant  2012  silver    sedan   angle
00010.jpg   chevrolet  equinox  2017  silver      suv    side

Customize Your CSV

We will modify some columns because sometimes you may not care about data, or possibly some columns are partially empty or otherwise unhelpful. ApertureDB loaders also expect certain columns in the CSV files.

df.drop( ["year","style"], axis=1, inplace=True)

Another common issue is that sometimes the file information might not be relative. In our sample data, this is the case.

old_filename = df['filename']
df['filename'] = [ 'cars/cars_train/cars_train/' + name for name in old_filename ]

This will ensure the filename is relative to the csv file - a requirement for loading in ApertureDB when using the CSV loaders.

Now we'll show creating a new column with existing data. Say you want to make an id column that doesn't have the leading zeros.

from pathlib import Path
df['id'] = [ int( Path(file).name.split(".")[0]) for file in old_filename ]

This can be very useful if some of the data had different file types, such as .png - then your queries can just search for id == 20 and not have to know if it is 00020.jpg or 00020.png.

Finally, we'll add a unique identifier constraint so if you re-run the load it won't try to duplicate the data, then show what this looks like, and output it to a CSV file.

df['constraint_id'] = df['id']
print(df)
df.to_csv('modified_cars.csv',index=False)

Output

                               filename        make    model   color   angle  id  constraint_id
cars/cars_train/cars_train/00001.jpg        audi       a7   white  angled   1              1
cars/cars_train/cars_train/00002.jpg       acura       tl   black  angled   2              2
cars/cars_train/cars_train/00003.jpg       dodge      ram     red  angled   3              3
cars/cars_train/cars_train/00004.jpg     hyundai   sonata     red    rear   4              4
cars/cars_train/cars_train/00005.jpg        ford    f-350   white  angled   5              5
cars/cars_train/cars_train/00006.jpg       izuzu   tracer     red  angled   6              6
cars/cars_train/cars_train/00007.jpg       dodge  journey    blue  angled   7              7
cars/cars_train/cars_train/00008.jpg       dodge  charger     red  rangle   8              8
cars/cars_train/cars_train/00009.jpg  mitsubishi   galant  silver   angle   9              9
cars/cars_train/cars_train/00010.jpg   chevrolet  equinox  silver    side  10             10

A column that is used as constraint gets indexed automatically. If you are loading with any other method, check if you have created an index. Otherwise the load can be slow. It is, of course, fastest without any constraints.

Load the CSV File

First we'll define a function to do the loading either in a Jupyter cell or in a load.py script.

from aperturedb.Connector import Connector
from aperturedb.ImageDataCSV import ImageDataCSV
from aperturedb.ParallelLoader import ParallelLoader
def load_image_and_metadata(dbconn:Connector, path_to_csv:Path):
    image_csv = ImageDataCSV(str(path_to_csv))
    loader = ParallelLoader(dbconn)
    loader.ingest(image_csv)

As you can see, this part is really easy, because all the difficult part has been done!

Now all we need to do is run it. We'll assume you've already configured your connection in the environment.

Load Method 1 :

from aperturedb import Utils
conn = Utils.create_connector()
load_image_and_metadata( conn, Path('modified_cars.csv'))

Load Method 2 : You can also use the adb command line to load the data:

adb ingest from-csv modified_cars.csv --ingest-type IMAGE

tip

Checkout transformers like CLIP embedding generation when loading images with the adb command line.

This is a good time to use the web UI and navigate the cars in your database.

Query Your Data

Now we can check to see if your data loaded correctly. You can copy this and run in a notebook cell or create a query.py file:

from aperturedb import Utils
conn = Utils.create_connector()

count_query=[{
   "FindImage": {
        "blobs":False,
        "results": {
            "count":True
        }
   }
}]
result,blobs = conn.query(count_query)
print(conn.get_last_response_str()) # prints result with JSON formatting

Output

[
    {
        'FindImage': {
            'count': 10,
            'returned': 0,
            'status': 0
        }
    }
]

Filter Using Metadata

Now, with the existing connection, we will choose only a subset to download / display - the red cars.

count_query=[{
   "FindImage": {
        "blobs": True,
        "constraints": {
            "color": ["==","red"]
        },
        "results": {
            "count": True
        }
   }
}]
red_car_result,red_car_blobs = conn.query(count_query)
print(conn.get_last_response_str())

Output

[
    {
        'FindImage': {
            'blobs_start': 0,
            'count': 4,
            'returned': 4,
            'status': 0
        }
    }
]

Visualize Data

The example below, requires you to run it in a jupyter notebook. A simple visualization of the data as a grid display:

This requires 3 packages inside the notebook: aperturedb, PILLOW and ipyplot (all can be installed using pip).

tip

If you have never installed packages inside a notebook, you can do the following:

!pip install aperturedb Pillow ipyplot

First we define a function to save images to disk after resizing.

from pathlib import Path
from io import BytesIO
import PIL
def save_images(input_blobs, save_path:Path, name_func):
    images = []
    labels = []
    for i,img in enumerate(input_blobs):
        with BytesIO(img) as img_bytes:
            pil_img = PIL.Image.open( img_bytes)
            dim_max = max(pil_img.height, pil_img.width)
            scale = dim_max/256.0
            w = int( pil_img.width / scale )
            h = int( pil_img.height / scale )
            pil_img_resized = pil_img.resize( (w,h) )
            file_name = "{0}.jpg".format( name_func(i) )
            path = save_path.joinpath( file_name )
            print(f"dim_max: {dim_max}, scale: {scale}, w: {w}, h: {h}, file_name: {file_name}, path: {path}")
            images.append(pil_img_resized)
            labels.append(file_name)
    return images,labels

then we display them

from ipyplot import plot_images

def make_car_name(blob_index):
    return f"car-{blob_index}"
images,labels = save_images( red_car_blobs, Path("./data"), make_car_name )
plot_images(images,labels,img_width=150)

Quick glance through the WebUI

You can see your images visually if you use the docker-compose for web-ui in the setup instructions.

png

Red cars on the ApertureDB UI

Next Steps

Now that you have seen how easy it is to load and retrieve a dataset with images, you can find your own dataset, create or manipulate a CSV file describing them and load all of it into ApertureDB. You can also create your own example application following the steps outlined here.

Let us know if this helped!

Working with a Dataset

Loading a dataset.​

Tools Needed​

Setup​

Create CSV Input​

Customize Your CSV​

Load the CSV File​

Query Your Data​

Filter Using Metadata​

Visualize Data​

Quick glance through the WebUI​

Next Steps​

Loading a dataset.

Tools Needed

Setup

Create CSV Input

Customize Your CSV

Load the CSV File

Query Your Data

Filter Using Metadata

Visualize Data

Quick glance through the WebUI

Next Steps