- 21 Sep 2023
- Updated On 21 Sep 2023
A dataset is a collection of Items (files), along with their metadata and annotations. It can be a file-system-like structure with folders and subfolders at any level. A dataset is mapped to a Driver that derives from an Integration, to contain items synced from external cloud storage. Cloning and merging are examples of dataset versioning operations.
Deriving from its data-versioning, there are different types of Datasets:
- Master: Original dataset that manages the actual binaries.
- Clone: Contains pointers to original files, enabling management of virtual items that do not replicate the binaries of the underlying storage once cloned or copied. When you clone a dataset, you can decide whether the new copy will contain metadata and annotations created over the original.
- Merge: Multiple datasets can be merged into one, which enables multiple annotations to be merged onto the same item.
The Dataloop platform has a flexible storage engine, which enables you to attach different binary storage providers, such as:
- Cloud storage services (External) :
Connect your data to the Dataloop storage system without copying it to have a single point of truth for your files and comply with various regulations.
- Dataloop's File System (Internal) .
Limitations & Considerations
- Empty folders synced from external storage will not be shown in the dataset.
- Moving or renaming files in the external storage will result in new instances (duplications) on the next sync. This can be avoided by avoiding direct work on files in the external storage, or by setting up 'upstream sync' to reflect such changes in the dataset.
- New files generated by the Dataloop platform are saved inside the bucket used in the storage-driver, under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
- Items on the Dataloop platform cannot be renamed or moved when originating from an external storage.
- Items can only be cloned from internal storage (i.e., the Dataloop's File System) to internal storage or from external storage to the same external storage.
Dataloop can set specific Datasets as Read-only. For more information, please contact us.
Datasets allows users to organize files in nested folders structure. Folder actions supported in the platform, via user-interface and SDK/API, are:
- Create folder
- Move item to folder (single or Bulk)
- Clone item(s) to folder
- Delete folder
The Dataset page provides access to all Datasets in the project. Datasets are listed in a customizeable table:
- Show/hide standard columns according to fields used.
- Add custom columns to better manage datasets.
Custom Dataset Fields
You can add your context to Datasets to manage them in your projects according to your needs. Context is added as user Meta-Data in the Dataset entity (any Meta-data field outside the System area). These fields can then be reflected as columns in the Datasets page, presenting the information and context, allowing for Datasets to be sorted and filtered by these fileds.
Add Custom Metadata
Adding Metadata to any entity, including Datasets, is done using the 'Update' function via SDK. For a tutorial about uploading items with metadata, read here.
// Example dataset.metadata["MyBoolean"] = True dataset.metadata["Mycontext"] = Blue dataset.update()
Expose Metadata As Dataset Table Columns
The Datasets page lists all the Datasets that exists in the project. The table includes default columns such as dataset-name, number of items, % of annotated items and more.
To add and present columns with your custom context (metadata fileds):
- From the project overivew (dashboard) go to the project-settings.
- Click Dataset columns.
- Click Update settings.
- Click Add column.
- Enter the required information.
- Name - a general name for this column (not seen outside the project-settings).
- Label - the column header in the Datasets page.
- Field- the Metadata field to map to this column.
- Set the desired settings as needed.
- Link - the field value will be presented as a URL opened in a new tab. Use that if your values are URLs.
- Resizeable - check this option to enbale the column to be resized. Use this option if you use long values that needs resizing to see in full.
- Sortable - Enable this option to allow clicking the column header and sort the table by this value.
- Click Apply.
Sort and Filter Datasets Based On Metadata
After completing the above steps, the Datasets table in the Datasets page will show your column and the data you populated there.
- Refresh the page to reflect any new data added via SDK.
- Use the search box to search datasets that meets the search term if its included in any of the columns added to the table.