In my last blog post The Curse of Dimensionality: Behind the Scenes, we discussed the curse of dimensionality and demonstrated how having too many features could be harmful to machine learning models. We mentioned that dimensionality reduction represents a trade-off between a more robust model and lower interpretability.

The curse of dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes or features that exist in it. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models.

In this second part of this series, we will dig into dimensionality reduction and explore practical algorithms that tackle the curse of dimensionality. Dimensional reduction techniques are very useful to transform sparse features into dense features. Furthermore, dimensionality reduction is also used for ‘cleaning’ the data and for feature extraction.

## Mitigating Curse of Dimensionality

To mitigate the problems associated with high dimensional data a suite of techniques generally referred to as ‘dimensional reduction techniques’ are used.

Dimensionality reduction is the task of reducing the number of features in a dataset. As discussed, **the higher the number of features, the more difficult it is to model them**. Additionally, some of these features can be quite redundant, adding noise to the dataset and it makes no sense to include them in the training data. This is where the feature space needs to be reduced.

The process of dimensionality reduction essentially transforms data from a high-dimensional feature space to a low-dimensional feature space. Simultaneously, it is also important that meaningful properties present in the data are not lost during the transformation.

Dimensionality reduction is commonly used in data visualization to understand and interpret the data, and in machine learning or deep learning techniques to simplify the task at hand.

## Dimensionality Reduction Techniques

Let’s present and discuss some of the most common techniques for dimensionality reduction. It is worth mentioning that these algorithms are supposed to be used based on your task and data. For instance, if the nature of your data is linear then use decomposition methods otherwise use manifold learning techniques.

## Decomposition Algorithms

#### Principal Component Analysis (PCA)

PCA is a dimensionality reduction method to find lower-dimensional space by preserving the variance as measured in the high dimensional input space. It is an unsupervised method for dimensionality reduction.

PCA transformations are linear transformations. It involves the process of finding the **principal components**, which is the decomposition of the feature matrix into eigenvectors. This means that PCA will not be effective when the distribution of the dataset is non-linear.

It is very important to choose the right hyperplane so that when the data is projected onto it, it has the maximum amount of information about how the original data is distributed.

In simple words, the idea of PCA is to reduce the number of variables of a data set, while **preserving as much information as possible**.

#### Singular Value Decomposition (SVD)

SVD is a factorization method of a real or complex matrix. It is efficient when working with a **sparse dataset**; a dataset having a lot of zero entries. This type of dataset is usually found in the recommender systems, rating, and reviews dataset, etc.

The idea of SVD is that every matrix of shape nXp factorizes into A = USVT, where U is the orthogonal matrix, S is a diagonal matrix and VT is also an orthogonal matrix.

The advantage of SVD is that the orthogonal matrices **capture the structure** of the original matrix A which means that their properties do not change when multiplied by other numbers. This can help us approximate A.

Eventually, we can use SVD to calculate a projection of a dataset and select a number of dimensions or principal components of the projection to use as input to a model.

#### Non-negative Matrix Factorization (NMF)

NMF is an unsupervised machine learning algorithm. When a non-negative input matrix X of dimension mxn is given to the algorithm, it is decomposed into the product of two non-negative matrices W and H.

From the equation above you can see that to **factorize the matrix**, **we need to minimize the distance**. The most widely used distance function is the squared Frobenius norm; this is an extension of the Euclidean norm to matrices.

In this, each data point that is represented as a column in A, can be approximated by an additive combination of the non-negative vectors, which are represented as columns in W.

It is also worth noting that this problem is not solvable in general which is why it is approximated. As it turns out, NMF is good for parts-based representation of the dataset i.e. **NMF provides an efficient, distributed representation**, and can aid in the discovery of the structure of interest within the data.

Therefore, by using NMF we are able to obtain factorized matrices having** significantly lower dimensions** than those of the product matrix.

## Discriminant Analysis

#### Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is most commonly used as a dimensionality reduction technique in the pre-processing step for pattern classification. The goal is to project a dataset onto a lower-dimensional space with **good class-separability** in order to avoid overfitting and also reduce computational costs.

The general approach is very similar to PCA, rather than finding the component axes that maximize the variance of our data, we are additionally interested in the axes that **maximize the separation** between multiple classes.

**LDA can be computed using the following 5 steps:**

- Compute the d-dimensional mean vectors for the different classes from the dataset.
- Compute the scatter matrices (in-between-class and within-class scatter matrices). the Scatter matrix is used to make estimates of the covariance matrix. This is done when the covariance matrix is difficult to calculate or joint variability of two random variables is difficult to calculate.
- Compute the eigenvectors (e1, e2, e3…ed) and corresponding eigenvalues (λ1,λ2,…,λd) for the scatter matrices.
- Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix W (where every column represents an eigenvector).
- Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: Y=X×W (where X is an n×d dimensional matrix representing the n samples, and y are the transformed n×k-dimensional samples in the new subspace).

## Manifold Learning

So far, we have discussed approaches that only involved linear transformation. But what do we do when we have a non-linear dataset? Manifold learning is a type of unsupervised learning that seeks to perform dimensionality reduction of non-linear data.

### t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a dimensionality reduction technique well suited for data visualization. Unlike PCA which simply maximizes the variance, **t-SNE minimizes the divergence between two distributions**. Essentially, it recreates the distribution of a high-dimensional space in a low-dimensional space rather than maximizing variance or even using a kernel trick.

**We can get a high-level understanding of t-SNE in three simple steps:**

- It first creates a probability distribution for the high-dimensional samples.
- Then, it defines a similar distribution for the points in the low-dimensional embedding.
- Finally, it tries to minimize the KL-divergence between the two distributions.

As you can see above, t-SNE clusters the data beautifully. Compared to PCA, t-SNE performs better than nonlinear data. The drawback with t-SNE is that when the data is larger it **consumes a lot of time**. Therefore, it is better to perform PCA followed by t-SNE, or use UMAP.

**UMAP**

t-SNE works very well on larger datasets but it also has its limitations, such as loss of large-scale information, slow computation time, and inability to meaningfully represent very large datasets.

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can preserve as much of the local and more of the **global data structure** as compared to t-SNE, with a shorter runtime. UMAP maps nearby points on the manifold to nearby points in the low dimensional representation, and does the same for faraway points. This method uses the concept of k-nearest neighbor and optimizes the results using stochastic gradient descent.

It first calculates the distance between the points in high dimensional space, projects them onto the low dimensional space, and calculates the distance between points in this low dimensional space. It then uses Stochastic Gradient Descent to **minimize the difference between these distances**.

Some of the key advantages of UMAP are:

- It can handle
**large datasets**and high-dimensional data without too much difficulty. - It combines the
**power of visualization**with the ability to reduce the dimensions of the data. - Along with preserving the local structure, it also
**preserves the global structure**of the data.

**Locally Linear Embedding (LLE)**

The LLE algorithm is an unsupervised method for dimensionality reduction. It tries to reduce n-dimensions while trying to **preserve the geometric **features of the original non-linear feature structure.

For example, in the illustration below, we cast the structure of the swiss roll into a lower-dimensional plane, while maintaining its geometric structure.

In short, if we have D dimensions for data X1, we try to reduce X1 to X2 in a feature space with d dimensions.

LLE first finds the **k-nearest neighbors** of the points. Then, it approximates each data vector as a** weighted linear combination** of its k-nearest neighbors. Finally, it computes the weights that best reconstruct the vectors from their neighbors, then produces the low-dimensional vectors best reconstructed by these weights. LLE optimizes relatively fast but fails on noisy data.

## Autoencoders

An autoencoder can be defined as a neural network whose primary purpose is to **learn the** **underlying manifold** or the feature space in the dataset. An autoencoder tries to reconstruct the inputs at the outputs. Unlike other nonlinear dimension reduction methods, the autoencoders do not strive to preserve a single property like distance or topology.

An autoencoder generally consists of two parts: an encoder that transforms the input to a hidden code and a decoder that reconstructs the input from hidden code.

One might wonder, how does feature learning or dimension reduction happen if the end result is the same as the input? The assumption behind autoencoders is that the transformation **“input –> hidden –> input“ **will help us learn important properties of the dataset.

## Final Thoughts

Dealing with thousands and millions of features is a must-have skill for any data scientist. The amount of data we are generating each day is unprecedented and we need to find different ways to figure out how to use it. Dimensionality reduction is a very useful way to do this and has worked wonders on multiple tasks.

When trying to decide on which dimension reduction technique to use, It is considered to be a good practice to first visualize your data and then decide which method to use.

It is not difficult to fall into the trap of considering one technique better than the other, at the end of the day there is no way to map high-dimensional data into low dimensions while preserving the whole structure at the same time, there is always a trade-off of qualities and one technique will have to be compared to the other.