External Cloud Storage
  • Dark
    Light
  • PDF

External Cloud Storage

  • Dark
    Light
  • PDF

Data management is an important part of the platform, setting it up correctly will contribute to lower costs and better data-processing, annotation work and automation performances.

There are 2 high-level storage types:

Dataloop storage - Create datasets using the Dataloop storage and upload your file items to have them securely stored on the Dataloop cloud. 

External cloud storage (S3/Azure/ GCS/Private container registry) - Connect your cloud storage to Dataloop. Your binaries (physical files like images, videos, etc) are stored on your external storage and are only indexed to the Dataloop platform. Dataloop does not create copies of your files/binaries - it only indexes them and reads them when access is required for annotation work or process by Automation tools (services and pipelines).

Important considerations when setting up external cloud storage

  1. Consider storing your files in a region close to your annotators, for faster file serving. In annotation work, files are streamed from your storage directly to the end user, without having to go through Dataloop servers first, so faster serving can be key for efficient work.
  2. Write access is required, to allow saving thumbnails, modalities and converted files to a hidden 'dataloop' folder on your storage. A permissions "test-file" will be written to your storage when the platform validates permissions.
  3. Annotations and metadata are stored in the Dataloop platform - if you delete a file from your external storage you'll need to trigger a file delete process in Dataloop, or setup Upstream-sync in advance to ensure these events are covered.

Setup Process

The setup process for external storage includes the following steps:

1. On your storage account (AWS S3/Azure Blob/GCP GCS/Private container registry) - prepare credentials and permissions to be used by Dataloop. See step-by-step instructions provided for each storage type. Permissions required are:

  • List (Mandatory) - allowing Dataloop to list all items in the storage.
  • Get (Mandatory) - get the items and perform pre-process functionalities like thumbnails, item info and etc. 
  • Put / Write (Mandatory) - lets you upload your items directly to the external storage from the Dataloop platform.
  • Delete - lets you delete your items directly from the external storage using the Dataloop platform.

2. In your Dataloop Organization - Create a new Integration, to enter the credentials prepared in previous step. These are saved in a Vault, and can be used by projects owned by the organization.

3. In a specific project - create a new storage-driver

  • Browse to the “Cloud storage” page in the “Data Management” section, from the left side navigation menu.
  • Enter a name for the new storage driver
  • Select from the dropdown an integration to use with this driver. Click on "Add New Integration" to start a process of adding a new integration (requires having an active ORG)
  • Enter the bucket/container name to use
  • If you wish to sync a specific folder and not the entire bucket, enter the folder path. Single file paths are not supported
  • Check the "Allow delete items" option to allow Dataloop to delete items from your storage and deleted them from the Dataloop dataset.

4. In the same project - Create a new Dataset, set for external storage, using the storage-driver created in the previous point.

Initial Sync

Once your dataset is setup correctly an initial sync operation is started automatically, listing all file items in your storage and indexing them. Successful completion of the process requires that:

  1. Integration credentials are valid with sufficient permissions. Review the step-by-step instructions we provide for each supported storage type.
  2. You've enables the 'Sync' option when creating the Dataset - otherwise you may start the sync process manually from the dataset-browser
  3. You have sufficient Data-Points in your quota - you should have available Data-Points matching the number of file items you are going to sync. 

After starting the sync process, progress can be reviewed from the notifications area.

Ongoing Sync - Upstream & Downstream

Once connecting you cloud storage with a dataset and completing the initial sync, the dataset reflects you directory structure and files content. Managing your files in the dataset, such as moving between folders, cloning, merging and other operations, acts as augmented management layer, and does not affect the binary files in your cloud storage.

  • Syncing empty folders from your storage will not show within the platform.
  • Moving or renaming items on your external storage will duplicate the items within the platform.
  • New files generated by the Dataloop platform will be saved inside your bucket under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
  • You cannot rename or move items on the Dataloop platform when they are linked to an external source.
  • Items can only be cloned from internal storage (i.e., the Dataloop cloud storage) to internal storage or from external storage to the same external storage

Cloned and Merged datasets in the Dataloop platform cannot be synced, since they are not directly indexed to the external storage.

Downstream

Syncing changes done in the Dataloop platform (e.g. the dataset - adding/deleting files) back to the original storage is referred to as downstream sync. 

File deletion - By default, Dataloop will NOT delete your binaries - files removed from datasets in the Dataloop system will NOT be removed from your cloud storage. If you wish to allow such deletion and have your cloud storage synced with changes made on the Dataloop platform, you need to specifically select the “Allow delete” option 

  1. Allow deleting permissions in your AWS/GCS/Azure IAM
    2. Check the "Allow delete" option in the storage driver
    3. Check the "Downstream sync" option in the dataset setting 

New Files - If you choose to add new files directly to the Dataloop dataset, and not through your external storage, then Dataloop will attempt to write those files to your external storage,  keeping them in sync.

Upstream

Syncing any file-item changes (create, move, delete, update) from your external storage into Dataloop is considered Upstream-sync. 

Upstream sync takes place in several ways:

  1. Automatically, once - when the storage driver is created (if the option was enabled when creating the dataset)
  2. Manually - Every time you click on the "Sync" button for a specific dataset (from the Dataset-browser, visible only to users with Developer role and higher). This will trigger a scan of files from your storage to index them into Dataloop. This means that files deleted from the cloud-storage may remain as 'ghost' files in your dataset. 
  3. Automatic Upstream sync - Dataloop offers code that can monitor for changes in your bucket and updates the respective datasets accordingly, allowing for automatic upstream sync of your data, when file-items are add or deleted. The code is available on the Dataloop GitHub. For more information, read here.

Google Private Container Registry


The following information depends on the 'private container registry' selection type;

  • Server - mandatory
  • Password - mandatory

    Note that the password is base64 and can be downloaded here using the following JSON file:

    ActionScript
    (
    "type": "service_account",
        "project_id": "",
        "private_key_id": "",
        "private_key": "",
        "client_email": "",
        "client_id": "",
        "auth_uri": "",
        "token_uri": "",
        "auth_provider_x509_cert_url": "",
        "client_x509_cert_url": ""
    }