Integrate ApertureDB into Labeling Workflows
Labeling is a common task in supervised machine learning involving multiple media types and many types of annotations. The process leads to even more information that needs to be stored with its corresponding media. Most labeling task are still very human driven because of the requirements for accuracy. For more detail read about Data Labeling and Annotation.
Because it has so many moving parts, keeping all the items organized, sychronized and searchable can be hard. ApertureDB makes this easy.
Before diving in, it is useful to decide how you want your pipeline to work, first identify where in your pipeline ApertureDB will reside. Given the design of ApertureDB, it can serve as either a source or a destination for pipeline data or both. Once that is determined, then you can create connectors between the various components.
All labeling tools will require import of items to be labeled. Some will want the resources local to the labeling tool but it is common to have those served from a remote location to help avoid data silos.
Data can be imported to ApertureDB using our Python data loaders or a rest interface, depending on what method best suits the existing infrastructure and labeling requirements. A push based system in general will work best with a rest interface, and a pull based will favor a python data loader.
After you have imported your data to ApertureDB then you can create annotations task for your media in the the labeling system. Creating the annotation tasks is often very specific to the tool. The important part is ensuring whatever metadata the system will exports is enough for you to identify the media the annotations belong to. Naming media with a guid is a good way to ensure easy mapping, because often a labeling system will have internal ids which are generated by how it loads tasks in. You will then feed the media either through a URL or the binary data into the tool. ApertureDB can automate this by serving the images and using metadata to keep track of what media hasn't been imported into the tool.
The last step is getting labeled data from the tool. This again is very specifc to the tool, but data will be available in a structured format, which then is a simple task of mapping into the format in ApertureDB. Depending on the requirements and tool, data can be exported on a schedule or via a trigger from the tool.
Working with Label Studio
Here we discuss examples and information relevant to specific labeling tools. This serves as an example for any other tool you might be using. Please reach out to us (email@example.com) for specific details.
A basic labeling which imports images and produces labels will be the example used here; More complex mappings are supported, but are beyond the scope of this document.
Some familiarity with the Label Studio API is helpful, specifically the API for importing tasks and exporting labeling.
The task importing documentation can be a little confusing because the endpoint is overloaded; The easiest method for import is a file which lists just the resources to be imported, and contains no extra information. Extra information can of course be included by using a CSV file with columns that are mapped into the labeling UI by the label config and in particular, variables.
With Label Studio, when importing a list of images it is important to note that
the tool will simply store those urls and then attempt to load them by
important. Our example shows how to simply satisify this � basically add a
Access-Control-Allow-Origin: *. An open policy like this is perfect
for testing, but security should be tightening down before production.
Since the Label Studio UI is web based, it expects media to be served over http(s); our Label Studio example serves the images with a simple Python script and generates the list of all the tasks to be added from this. You can serve these from a dedicated server like nginx, or you can serve them from within ApertureDB and use a simple REST gateway to automatically generate the list, and serve files.
Next you must import the list of tasks into LS. Our example shows how to use the API to load a single URL. You can easily change this to a CSV file if you need extra data. After loading the tasks, the next piece is exporting the labeling. Our example does a single full export without batching, which may not work if you have a lot of data to export. For running an export on a schedule, you might want to ensure only data which changed since last run was exported.