Online Dataset Formats

In addition to the various ingestion methods for data that you might be generating through your pipelines periodically or have in existing databases, we also support ingestion from some public dataset formats that are commonly used. This relies on the use of Query Generators that can be extended for the specific dataset formats as explained with some popular examples next.

COCO JSON Format

COCO is one of the most popular datasets for object detection and its annotation format, usually referred to as the “COCO format”, has also been widely adopted. The “COCO format” is a json structure that governs how labels and metadata are formatted for a dataset (official documentation here). It is a data format commonly used for training and inference in object detection tasks.

ApertureDB SDK provides methods to ingest a dataset available in the COCO JSON or COCO object detection annotation format file. You can refer to its implementation here. It is defined as a Query Generator through it's base class PyTorchData, for which it implements a getitem.

The reason why it's named with PyTorch is because it relies on parsing the annotations through a PyTorch class CocoDetection

The role of getitem here is to convert the values of the x, y tuples and other information into equivalent ApertureDB queries.

If you combine how this schema is represented with the query and blobs that ApertureDB expects, we can generate commands per record as shown below:

[{
    "AddImage": {
        "_ref": 1
        # .... other key,value properties ....
    },
    "AddBoundingBox": {
        "image_ref": 1,
        "label": "Dog"
        # .... other key,value properties ....
    },
    "AddBoundingBox": {
        "image_ref": 1,
        "label": "Cat"
        # .... other key,value properties ....
    },
    "AddPolygon":{
        "image_ref": 1,
        "label": "Dog"
        # .... other key,value properties ....
    },
    "AddPolygon":{
        "image_ref": 1,
        "label": "Cat"
        # .... other key,value properties ....
    }
}]

Kaggle Repository

Kaggle is a popular place to host datasets, and ApertureDB lets users tie into those repositories to implement a Query Generator that can convert the records from these contributed datasets into a series of queries that can persist the dataset into ApertureDB.

The datasets on Kaggle are often in bespoke formats, which is why we demonstrate how to work with them through a reference implementation using the CelebA dataset. The CelebA Dataset consists of 202,599 face images of various celebreties, with each image annotated with binary attributes like bald, Black_Hair, ..

It also has 5 points for key facial points.
It encodes false as -1, true as 1

The Query Generator

[{
    "AddImage": {
        "_ref": 1,
        "bald": -1,
        "Black_hair": 1,
        "keypoints" : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    }
}]

Hugging Face

Coming soon...

Croissant

Coming soon...

Online Dataset Formats

COCO JSON Format

Further reading

Kaggle Repository

Further reading

Hugging Face

Croissant

Online Dataset Formats

COCO JSON Format​

Further reading​

Kaggle Repository​

Further reading​

Hugging Face​

Croissant​

COCO JSON Format

Further reading

Kaggle Repository

Further reading

Hugging Face

Croissant