Online Dataset Formats
In addition to the various ingestion methods for data that you might be generating through your pipelines periodically or have in existing databases, we also support ingestion from some public dataset formats that are commonly used. This relies on the use of Query Generators that can be extended for the specific dataset formats as explained with some popular examples next.
COCO JSON Format
COCO is one of the most popular datasets for object detection and its annotation format, usually referred to as the “COCO format”, has also been widely adopted. The “COCO format” is a json structure that governs how labels and metadata are formatted for a dataset (official documentation here). It is a data format commonly used for training and inference in object detection tasks.
ApertureDB SDK provides methods to ingest a dataset available in the COCO JSON or COCO object detection annotation format file. You can refer to its implementation here.
It is defined as a Query Generator through it's base class PyTorchData,
for which it implements a getitem
.
The reason why it's named with PyTorch is because it relies on parsing the annotations through a PyTorch class CocoDetection
The role of getitem here is to convert the values of the x, y tuples and other information into equivalent ApertureDB queries.
If you combine how this schema is represented with the query and blobs that ApertureDB expects, we can generate commands per record as shown below:
[{
"AddImage": {
"_ref": 1
# .... other key,value properties ....
},
"AddBoundingBox": {
"image_ref": 1,
"label": "Dog"
# .... other key,value properties ....
},
"AddBoundingBox": {
"image_ref": 1,
"label": "Cat"
# .... other key,value properties ....
},
"AddPolygon":{
"image_ref": 1,
"label": "Dog"
# .... other key,value properties ....
},
"AddPolygon":{
"image_ref": 1,
"label": "Cat"
# .... other key,value properties ....
}
}]
Further reading
- Queries on COCO Dataset (Based on Metadata or Embeddings)
- AddImage: Query Command to Add Image with Metadata.
- AddBoundingBox: Introduce bounding box regions of interest
- AddPolygon: Introduce polygon regions of interest
Kaggle Repository
Kaggle is a popular place to host datasets, and ApertureDB lets users tie into those repositories to implement a Query Generator that can convert the records from these contributed datasets into a series of queries that can persist the dataset into ApertureDB.
The datasets on Kaggle are often in bespoke formats, which is why we demonstrate how to work with them through a reference implementation using the CelebA dataset. The CelebA Dataset consists of 202,599 face images of various celebreties, with each image annotated with binary attributes like bald
,
Black_Hair
, ..
- It also has 5 points for key facial points.
- It encodes false as -1, true as 1
The Query Generator
[{
"AddImage": {
"_ref": 1,
"bald": -1,
"Black_hair": 1,
"keypoints" : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
}
}]
Further reading
- Queries on CelebA Dataset (Based on Metadata or Embeddings)
- AddImage: Query Command to Add Image with Metadata.
- Descriptors or embeddings: Example of working with embeddings
Hugging Face
Coming soon...
Croissant
Coming soon...