Ingest from Bucket Workflow

This workflow allows you to ingest data from an AWS or GCP bucket for which you have credentials.

This video demonstrates how to ingest data into ApertureDB from a cloud bucket.

Creating the workflow

Creating and deleting workflows

For general information on creating workflows in ApertureDB Cloud see Creating and Deleting Workflows.

This workflow supports multiple different cloud providers.

AWS S3
GCP GS

This workflow allows you to ingest images from the AWS Simple Storage Service (S3) into ApertureDB. This lets you use your own existing data. This provides an easy way to get started with ApertureDB, and to see how it can be used with real data.

[object Object] — Enter the AWS S3 bucket name. You may be able to list your S3 buckets using `aws s3 ls` or by using the AWS S3 Console.
Enter the AWS access key. See AWS documentation for details on how to generate these keys or the Getting Credentials section on this page. It is important that the credentials have the appropriate permissions to access the bucket; see the Setting Permissions section on this page. For your own security, you may wish to generate new keys for this purpose.
Enter the AWS secret access key; this is a password field.
Decide whether you want to process images, videos, or PDFs. You can choose as many as you want, but you should select at least one.
Click the blue button at the bottom.

See the GitHub repository for more information

For more detailed information about what this workflow is doing, additional information about the parameters, and how to run the workflow outside of the ApertureDB Cloud, see the ingest-from-bucket documentation in GitHub.

See the results

Results will start being available in your database as soon as your bucket status is 'Started'.

To view data you have ingested, go to the Web UI for your instance.

Getting Credentials

If you need help getting the proper credentials for your bucket, the following are some hints which can help for users with standard configurations

AWS S3
GCP GS

From the console, first type in 'IAM' to the search box in the top:

Select 'IAM', and find 'Users' in the menu on the left.

Once you select 'Users', find the user that you will use to access the data. Use search if you have many users. Click on the link in the 'User name' column.

Once you are in the page for the user, click on the tab on the right content side that says 'Security Credentials'

Now scroll down in the right content side until you see a section labled 'Access keys'. Choose 'Create Access Key'.

Now, choose 'Application Running Outside AWS' and click on 'next'

Choose a name that will mean something to you, and click 'Create access key'

Now retrieve the information to your access key, either by selecting the copy for the access key and secret key, or by downloading a csv.

Once you no longer need your key, delete it.

Setting Permissions

Permission management is critical for allowing access to your data, and we strive to use minimal permissions for our workflows.

AWS S3
GCP GS

Giving a user ReadOnlyAccess is a simple way to provide adequate access.

The minimal access required follows:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:ListBucket"
			],
			"Resource": [
				"arn:aws:s3:::YOUR_BUCKET_NAME",
				"arn:aws:s3:::YOUR_BUCKET_NAME/*"
			]
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListAllMyBuckets"
			],
			"Resource": "*"
		}
	]
}

ListAllMyBuckets is used to verify that the account credentials have been supplied correctly and to aid in detecting misconfiguration of bucket names or permissions.

Giving your service account the 'Viewer' role is an easy way to specify limited, but sufficent permissions.

The minimal permissions are

storage.buckets.get
storage.objects.get
storage.objects.list

the objects permissions can be restricted to just the bucket by adding an IAM condition containing the permissions to the role.

By setting the condition to resource.name.startsWith('projects/_/YOUR_BUCKET_NAME') you can limit reads to only the target bucket.

Creating the workflow​

See the results​

Getting Credentials​

Setting Permissions​

Creating the workflow

See the results

Getting Credentials

Setting Permissions