- 21 Sep 2023
- DarkLight
- PDF
Upload Data
- Updated On 21 Sep 2023
- DarkLight
- PDF
Load Data From Cloud Storage
To learn how to connect your cloud storage and sync files to Dataloop using the SDK, see Data Management Tutorial.
Considerations
- Ensure you have the write access to save thumbnails, modalities, and converted files to a hidden .dataloop folder on your storage.
- A permission test-file is added to your storage folder when the platform validates permissions.
- Annotations and metadata are stored in the Dataloop platform.
- If you delete a file from your external storage, you might need to trigger a file delete process in Dataloop or set up an Upstream sync in advance to ensure these events are covered.
Setup Process
The setup process for external storage includes the following instructions:
- In your external storage account (AWS S3/Azure Blob/GCP GCS/Private container registry), prepare credentials and permissions to be used by Dataloop. See the step-by-step instructions provided for each storage type. Permissions required are:
- List (Mandatory): Allows Dataloop to list all items in the storage.
- Get (Mandatory): Allows to get the items and perform pre-process functionalities, such as thumbnails, item info and etc.
- Put / Write (Mandatory): Allows you to upload your items directly to the external storage from the Dataloop platform.
- Delete: Allows you to delete your items directly from the external storage using the Dataloop platform.
- In your Dataloop Organization, create a new Integration to enter the credentials prepared in the previous step. These are saved in a Vault, and can be used by projects owned by the organization.
- In a specific project, create a new storage-driver. To create a new storage-driver, perform the following instructions:
- From the left-side navigation menu, click on the Data Management section and select the Cloud Storage page.
- Click Create Driver. A popup message is displayed.
- Enter a name for the new storage driver.
- Select an Integration from the list to use with the driver.
- If required, click on +Add New Integration to add a new integration. Ensure you have an active Organization.
- Enter a bucket or container name.
- If you need to sync a specific folder, enter the folder path. Single file paths are not supported.
- Enter your Storage Class, if required. By default, the storage class is standard.
- Select your Region from the list.
- Check the** Allow delete items** option to allow Dataloop to delete items (that are deleted from the Dataloop dataset already) from your external storage.
- In the same project, create a new Dataset and configure it for external storage using the already created storage driver.

Initial Sync
Once you set up your dataset correctly, an initial sync operation starts automatically. It lists all file items in your storage and indexes them. Successful completion of the process requires that:
- Integration credentials are valid with sufficient permissions.
- Review the step-by-step instructions for each supported storage type.
- Enabled the Sync option when creating the Dataset. Otherwise, you can start the sync process manually from the dataset browser.
- Sufficient Data-Points in your quota. You must have available Data-Points matching the number of file items you are going to sync.
After starting the sync process, you can review the progress from the notifications area.
Ongoing Sync - Upstream & Downstream
Once you connect your cloud storage with a dataset and complete the initial sync, the dataset reflects your directory structure and file content. Managing your files in the dataset, such as moving between folders, cloning, merging, etc. acts as an augmented management layer and does not affect the binary files in your cloud storage.
- Syncing empty folders from your storage is not displayed within the platform.
- Moving or renaming items on your external storage will duplicate the items within the platform.
- Dataloop platform saves newly generated files inside your bucket under the folder "/.dataloop" including video thumbnails, .webm files, snapshots, etc.
- You cannot rename or move items on the Dataloop platform when they are linked to an external source.
- Items can only be cloned from internal storage (i.e., the Dataloop cloud storage) to internal storage or from external storage to the same external storage.
You cannot sync cloned and merged datasets in the Dataloop platform, because these are not directly indexed to the external storage.
Downstream
Downstream-sync is the process of updating any file-item changes from your Dataloop platform (for example, the dataset - adding/deleting files) into your external storage (or original storage).
Downstream sync is always active
New files: If you choose to add new files directly to the Dataloop dataset, and not through your external storage, then Dataloop will attempt to write new files to your external storage to keep them in sync.
File deletion: By default, Dataloop will NOT delete your binaries. The files deleted from Dataloop datasets will not be deleted from your cloud storage. You must specifically choose the Allow delete option if you want to permit such deletion and have your cloud storage updated to reflect changes made to the Dataloop platform.
- Allow deletion permissions in your IAM for AWS, GCS, and Azure.
- Check the Allow delete option in the storage driver.
Upstream
Upstream sync is the process of updating any file-item changes from your external storage into Dataloop platform.
There are various ways that upstream sync happens:
Automatically, once: at the time the storage driver is created (assuming the option was enabled when creating the dataset).
Manually: Each time you click Sync for a specific dataset (from the Dataset-browser, visible only to users with Developer role and higher). It will trigger a scan of the files on your storage to index them into Dataloop. This means that files deleted from the cloud-storage might remain as 'ghost' files in your dataset.
Automatic Upstream sync: Dataloop offers a code that can monitor for changes in your bucket and updates the respective datasets accordingly. It allows automatic upstream sync of your data when file-items are modified. For more information, see Dataset Binding with AWS.
ETL Pre-Processing
A created event trigger on the Dataloop platform is activated whenever a file is added to a dataset. By default, all Image files that are uploaded or added to Dataloop are prompted to use a global image-preprocess service that:
- Generate a thumbnail for the images.
- Extracts information about the file and stores it in the items' meta-data entity of the Dataloop platform, including the following information:
- Name
- Size
- Encoding (for example, 7bit)
- Mimetype (for example, image/jpeg)
- Exif (image orientation)
- Height (in pixels)
- Width (in pixels)
- Dimensions
Private Preprocessing Service
Dataloop's global preprocessing serves all of its customers from a single service, which is designed to scale in accordance with load while maintaining a reasonable configuration.
Customers who want to become independent of the global preprocessing service and avoid experiencing delays in production-level projects that might be due to load generated by other customers of Dataloop, can choose to install it in their projects running on their own resources.
To learn more about this option, contact Dataloop.
Additional Preprocessing Services
Users can define created trigger in their projects to invoke uploaded items in their own functions for customized preprocessing.
To learn how to configure created trigger on services, read here.
User Metadata Upload
Once your items are in a Dataset, you can use the SDK or API to add your custom context as user metadata. Several examples include:
- Camera number
- Area taken
- LOT/Order number
To learn how to upload items with user-metadata, read here.
Upload Files To Dataloop Storage
When using the File system option for Datasets, files' binaries are uploaded and stored on the Dataloop storage (GCP hosted).
To upload files to your dataset:
- In the Project Overview page, go to the Data-management widget and find the dataset.
- Click Upload Items icon. Alternatively, when in the Dataset-browser, click Upload icon.


- Select Upload file or Upload folder from the list, if you need to upload individual files, or entire folders.
A browser crash or halt is possible while attempting to upload one or more files larger than 1GB and more than 100 folders since browsers are not designed to manage the upload process. Your computer's configuration and Internet connection can affect the precise number. To upload a large number of files and folders, use the SDK as detailed in the Upload and Manage Data & Metadata tutorial.
Linked Items / URL Items / Bulk Upload
Linked items enable using a file in the Dataloop platform without storing it on Dataloop servers or building an integration with cloud storage. Linked items are facilitated using JSON files containing the URL as a pointer to the binary file that is stored on the customer's storage.
Use links in one of the following scenarios:
- To keep binaries out of the Dataloop platform
- To duplicate items without actually duplicating their binaries
- To reference public images by URL
Linked Item (JSON)
- Create a JSON file that contains a URL to an item from your bucket.
- Upload the JSON file to the platform.
The JSON will appear as a thumbnail of the original item with a link symbol. Each time you click to open the file, the Dataloop platform will fetch the stream of the original item.
The video duration is not available for linked items, only for local items.
{
"type": "link",
"shebang": "dataloop",
"metadata": {
"dltype": "link",
"linkInfo": {
"type":"url",
"ref":"https://www.example.com",
"mimetype": "video"
}
}
}
Multiple Linked Items
To connect linked items to a Dataset in bulk, the platform supports importing them from a CVS file.
Upload a CSV file that contains a list of URLs for items in your bucket. Once the CSV is uploaded, the platform creates a folder using the CSV file name and generates JSON files in that folder linked to the original items, and presents them in your dataset.
CSV files can also be used to upload file binaries in bulk. See the following upload action for more information.
Field | Required | Description |
---|---|---|
ID | Yes | Optional file name in case a name is not provided |
URL | Yes | Link to the item. This link must be public for the browser to be displayed. |
image_bytes | Yes | Mandatory, if URL is not provided |
name | No | Item name to use in the Dataset |
action | No | link - generate the JSON upload - fetch the file to upload into Dataloop |
mimetype | no | Image/jpg - default mime type to be used. Video - will be used for items that include videos. |
item_metadata | No | Update the metadata of the item |
item_description | No | Add text to the description root property of the item |