Unleash Unsupervised Learning

Unleashing the Unsupervised

In recent years, the world has gotten “smarter” as the AI field has made tremendous progress. To keep up with consumer expectations, companies are increasingly using machine and deep learning algorithms and developing systems that can learn from large amounts of labeled data.

Because of its simplicity, supervised learning, the task of training predictive models using data points with known outcomes, is generally the preferred approach in the industry.
It has a proven track record for training models that perform extremely well on the task they were trained to do.

So Where is the Bottleneck?

Supervised learning requires accurately labeled data, the collection of which is often labor intensive. There are also some tasks for which there’s simply not enough labeled data. Moreover, there’s a limit to how far the field of AI can go using supervised learning alone.

If we could get a deeper understanding of our use case data beyond what’s specified in limited labeled training sets, and provide more general features for learning to perform any tasks, AI systems could be more useful, comprehensive and much more “intelligent.”

Setting the Supervision Loose

Unsupervised learning uses unlabeled training samples to model basic characteristics of an AI system’s input and discover hidden patterns in the data. These characteristics can be a useful starting point for supervised learning, and they can be used to extrapolate what is learned from labeled training sets.

What if we can get “free of charge” labels for unlabeled data and train unsupervised datasets in a supervised manner? We can achieve this by framing a supervised learning task in a special form to predict only a subset of information using the rest. In this way, all the information needed, both inputs and labels, has been provided. This method is known as self-supervised learning.

A summary of how self-supervised learning tasks can be constructed (source: LeCun’s talk)

Self-supervised learning can be considered as a subset of unsupervised learning. More precisely, unsupervised learning focuses on detecting specific data patterns (such as clustering, community discovery, or anomaly detection), while self-supervised learning aims at recovering missing parts, which is still in the paradigm of supervised settings.

How Does It Actually Work?

Self-supervised learning obtains supervisory signals from the data itself, often revealing the underlying structure in the data. Rather than labels, we are interested in the learned intermediate representation of the data, with the expectation that this representation can carry good semantic or structural meanings and can be beneficial to a variety of practical downstream tasks.

This approach can be described as “A machine that predicts any parts of its input for any observed part.” The learning part includes obtaining “labels” from the data itself by using a semiautomatic process. Here, the “parts” could be transformed, distorted, or corrupted fragments of the input data.
To put it simply, the AI system learns to “recover” whole, or parts of, or solely some features of its original input.

As a matter of fact, self-supervised learning is one of the most promising ways to build a background knowledge and approximate a form of “common sense” in AI systems.
Systems pretrained with this manner yield considerably higher performance than when merely trained in a supervised manner.

How Can We Benefit From Self-Supervised Learning?


Supervised learning requires large labeled datasets to build proper models and make accurate predictions. Self-supervised learning can automate this process and handle a task with even massive amounts of data, by learning objectives accordingly so as to get supervision from the data itself.

Harness the “Human Mind”

Supervised models require human intervention to perform correctly. However, those interventions don’t always exist. Unlike machines, humans can think through the consequences of their actions before making them, and they don’t have to experience all actions to decide what to do.

Machines also have the potential to work in the same way. Once they’ve mastered learning the underlying structure of the data, they can use this “knowledge” in a similar way to human intervention. In that manner, self-supervised learning automatically generates meaningful representations without human intervention and enables machines to come up with a solution without any interference, to a large variety of tasks and scenarios.


Recently, we have witnessed self-supervised learning-based models improving computer vision and natural language processing capabilities. The following are some of its major applications in computer vision:

Image Based


Rotation of an image is an interesting way to modify an input image while the semantic content stays unchanged. In order to identify the same image with different rotations, the model has to learn to recognize semantic concepts, rather than local patterns.

Image Based Rotation Application
Incorporating rotation invariance into the feature learning framework, (source: Feng et al. (2019))

Contrastive Learning

The approach for contrastive learning methods is to train a model to cluster an image and its slightly augmented version in latent space and ”learn to compare” through a Noise Contrastive Estimation.

SimCLR is a framework for contrastive learning of visual representations. (source: Chen et al. (2020))
SimCLR is a framework for contrastive learning of visual representations. (source: Chen et al. (2020))

Generative Modeling

The pretext task in generative modeling is to reconstruct the original input while learning meaningful latent representation.

A regressor for a low-dimensional latent face representation.  (source: Tewari et al. (2018))
A regressor for a low-dimensional latent face representation. (source: Tewari et al. (2018))

Video Based


Self-supervised learning can be used for colorization, resulting in a rich representation that can be used for video segmentation and unlabeled visual region tracking, without extra fine-tuning.

Use video colorization to track object segmentation and human pose in time. (source: Vondrick et al. (2018))
Use video colorization to track object segmentation and human pose in time. (source: Vondrick et al. (2018))


The movement of an object is traced by a sequence of video frames. Any visual representation learned for the same object across close frames should be close in the latent feature space. Therefore, it could be achieved by self-supervised learning.

The dense tracking system: MAST - Memory Augmented Self-Supervised Tracker. (source: Lai Et. al (2020))
The dense tracking system: MAST – Memory Augmented Self-Supervised Tracker. (source: Lai Et. al (2020))

Summed Up

The progress of self-supervision in recent years is encouraging, and it is a monumental step towards human-level intelligence. As we further understand this better, self-supervision would help in building real-world scenarios, and we will get closer to creating AI systems that think more similar to humans.

Do you want to reveal the underlying structure of your data with Dataloop?
Find out how. Let’s talk.

Share this post


Related Articles