External Cloud Storage
  • Dark
    Light
  • PDF

External Cloud Storage

  • Dark
    Light
  • PDF

The Dataloop platform has 2 types of storage, Dataloop cloud storage and External cloud storage (S3/Azure/ GCS/Private container registry).

To connect your external storage to Dataloop, you must be a member of an Organization, and have it set as your active organization.

When using external storage your binaries (physical files like images, videos, etc) are stored on your external storage and are only indexed to the Dataloop platform. After the initial indexing when first adding the storage, changes in your storage will not reflect to Dataloop until you manually initiate a new sync process (from the Dataset browser). However adding files to the synced dataset will add them to your storage, and deleting files from the dataset will remove them from your storage if you choose the 'Allow delete' option.
In order to show and manage these items, Dataloop needs permissions to your external storage.

Write access is required to save thumbnails, modalities and converted files on your storage. A permissions "test-file" will be written to your storage when the platform validates permissions.
Since annotations and metadata are stored on the platform, when deleting an item from your external storage, its corresponding annotations and metadata will remain on the platform.

How External Storage is Set 

Connecting your files that are stored on external storage is a 3 step process:

  1. Storage Integration - the initial connection to the cloud storage, and where keys and secrets are stored in a vault. It is created on the Organization level (“Integrations” menu), where the organization owner can enter the details once, without having to give the keys to anyone else.
  2. Storage Driver - setting up a driver, to connect to a specific storage bucket. This is done on the project level, where each project can be connected to a different bucket, all relating to the single integration. 
  3. Dataset Connection - when starting a new dataset, link it with a storage driver to have it contain the files in that bucket.

Setting Up External Storage Integration

Permissions

  • List (Mandatory) - allowing Dataloop to list all items in the storage.
  • Get (Mandatory) - get the items and perform pre-process functionalities like thumbnails, item info and etc. 
  • Put / Write (Mandatory) - lets you upload your items directly to the external storage from the Dataloop platform.
  • Delete - lets you delete your items directly from the external storage using the Dataloop platform.

Limitations & Considerations

  • Syncing empty folders from your storage will not show within the platform.
  • Moving or renaming items on your external storage will duplicate the items within the platform.
  • New files generated by the Dataloop platform will be saved inside your bucket under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
  • You cannot rename or move items on the Dataloop platform when they are linked to an external source.
  • Items can only be cloned from internal storage (i.e., the Dataloop cloud storage) to internal storage or from external storage to the same external storage

Permissions & Policies

AWS S3 Access & Policy

The IAM Policy used for creating integration to your S3 should include the following:

  • s3:ListBucket - Mandatory
  • s3:PutObject - for write permissions - Mandatory.
  • s3:DeleteObject - for the allow delete option
  • s3:GetObject - Mandatory

For a step-by-step guide on setting up IAM policy in AWS, read here.

GCS Access and Policy

  1. Create user with IAM policy that includes the following permissions:
    1. storage.objects.list - mandatory
    2. storage.objects.delete
    3. storage.objects.create - mandatory 
    4. storage.objects.get - mandatory 
    5. storage.buckets.get - mandatory 
  2. Create a new private key and download the JSON file

For a step-by-step guide on setting up IAM policy in GCS, read here.

Azure Integration Setup

  1. Register a new App as "dataloop-app"
  2. Locate the app's clientID and tenantID 
  3. Create a new client secret in "certificate & secrets"
  4. In "Storage accounts", select your account (or create a new one), go to "Containers" and select the container to add the Dataloop app permissions to.
  5. In IAM select to add role assignment: "Role: Storage Blob Data Contributor"
  6. With dataloop-app selected, add the following role assignments
    1. Role: Storage Blob Data Contributor
    2. Assign access to: User, group, or service principals
    3. Select: dataloop-app
  7. It can take 5 minutes for the permissions to be updated and available to use by Dataloop.

For a step-by-step guide on setting up a policy in Azure, read here.

AWS Secure Token Service (STS)

AWS STS allows customers to set up temporary, limited-privilege credentials for IAM users, allowing authenticated third-party users to access data.

To set up STS in AWS, you"ll need to:

1. Create a role and attach an IAM policy to it, with the following minimum configuration

         "Effect": "Allow",
         "Principal": {"AWS": IAM User ARN},
"Action": "sts:AssumeRole"

2. Create a user and attach an IAM policy to it, with the following minimum configuration

          "Effect": "Allow",
          "Resource": Role with permissions ARN,
          "Action": "sts:AssumeRole"

3. Enter the key, Secret and Role ARN in the integration input fields


Private Container Registry 

The following information depends on the 'private container registry' selection type;

  1. Google Container Registry (GCP)
    • Server - mandatory
    • Password - mandatory

      Note that the password is base64 and can be downloaded here using the following JSON file:

      ActionScript
      (
      "type": "service_account",
          "project_id": "",
          "private_key_id": "",
          "private_key": "",
          "client_email": "",
          "client_id": "",
          "auth_uri": "",
          "token_uri": "",
          "auth_provider_x509_cert_url": "",
          "client_x509_cert_url": ""
      }

  2. Elastic Container Registry (AWS)
    1. Account - mandatory
    2. Region - mandatory
    3. Access key id - mandatory
    4. Secret key id - mandatory

      For a step-by-step guide on how to extract the required parameters from AWS, read here.





Creating Storage Driver

  1. Browse to the “Cloud storage” page in the “Data Management” section, from the left side navigation menu.
  2. Enter a name for the new storage driver
  3. Select from the dropdown an integration to use with this driver. Click on "Add New Integration" to start a process of adding a new integration (requires having an active ORG)
  4. Enter the bucket/container name to use
  5. If you wish to sync a specific folder and not the entire bucket, enter the folder path. Single file paths are not supported
  6. Check the "Allow delete items" option to allow Dataloop to delete items from your storage and deleted them from the Dataloop dataset.


Syncing External Storage

Upstream Sync

Syncing data items from your external storage into Dataloop is considered Upstream-sync. Upstream sync occurs:

  1. Automatically, once - when the storage driver is created.
  2. Manually, when clicking on "Sync" button for a specific dataset, from the Dataset-browser (the sync button is only visible to users with Developer role and higher)

Setting up automatic Upstream sync

Dataloop offers code that can monitor for changes in your bucket and updates the respective datasets accordingly, allowing for automatic upstream sync of your data. The code is available on the Dataloop GitHub. For more information, read here.

Downstream Sync

Syncing changes done in the Dataloop platform (e.g. the dataset - adding/deleting files) back to the original storage is referred to as downstream sync. 

To set up downstream sync, you need to:

  1. Allow deleting permissions in your AWS/GCS/Azure IAM
    2. Check the "Allow delete" option in the storage driver
    3. Check the "Downstream sync" option in the dataset settings

Cloned and Merged datasets in the Dataloop platform cannot be synced, since they are not directly indexed to the external storage.

Binary Files Management

* Dataloop does not create copies of your files/binaries - it only indexes them and reads them when access is required for annotation work.

* By default, Dataloop will NOT delete your binaries - files removed from datasets in the Dataloop system will NOT be removed from your cloud storage. If you wish to allow such deletion and have your cloud storage synced with changes made on the Dataloop platform, you need to specifically select the “Allow delete” option.


What's Next