What is a Dataset?
-
Print
-
DarkLight
Dataset
DATASET is a collection of items (files), their metadata and annotations, and it is the basic unit for managing training sets.
In Dataloop we distinguish between two types of datasets:
Master dataset
Clone dataset
Dataset is the basic data storage and/or management unit in Dataloop, similar to storage buckets found in AWS, Azure and GCP
Master Dataset
The Master Dataset is a Dataset that holds the root storage to the items, so any item actions (e.g. delete) can remove the item completely from the Dataloop platform.
In addition, Master Datasets allow you to maintain a file system like structures, i.e. folders, sub folders, etc.
Master Datasets are usually used to manage raw data management for direct uploads.
Master Datasets are the default behavior of Dataloop.
New users can skip this section
Cloned Dataset
Cloned Datasets are a list of pointers, functioning as virtual items that do not replicate the binaries of the underlying storage once cloned or copied.
Cloned datasets do not allow file system structure; they are a flat list of items.
Cloned datasets are mainly used for:
- Golden training sets management
- Reproducibility (dataset training snapshot)
- Experimentation (creating subsets from different kinds)
- Task/Assignment management
Merged Dataset
You can merge multi datasets into a single one to better organize your data.
Merge cloned datasets to have the annotations from different datasets on the same item.
For additional information, please go to the Clone & Merge Datasets page.
New users may skip this section
Data Storage
The Dataloop platform has a flexible storage engine, which enables to attach different binary storage devices such as:
- Cloud storage devices like GCS, S3, Elastifile etc.
- File system storage
- Network drives
- Databases, such as: mongo GridFS
Each storage medium is supported through its drivers and additional drivers are continuously being added.
More in Storage .