Imagine yourself being in the middle of telling someone a long story or struggling through an explanation of something complicated when the other person looks at you and asks, “What’s the point?”
We can relate to that person’s needs, all they really want is that we deliver the information quickly and to the point. This problem is the essence of what’s known as “the curse of dimensionality”.
We live in a world of too much information. The same holds true in machine learning. There is almost always a huge set of potential features we can use to make our prediction. When confronted with a massive amount of data that has a high number of characteristics, we can use dimensionality reduction algorithms to make the data “get to the point”.
High Dimensional Data and the Problem it Holds
“The curse of dimensionality” sounds like an intimidating phrase but what it actually refers to is when your data has too many features.
Many texts on statistics, machine learning, and computer science reference the “curse of dimensionality”, which is attributed to the applied mathematician, Richard Bellman. This phrase refers to the phenomenon that increasing data dimension exponentially increases the volume of space that data can occupy. Accordingly, in high-dimensional spaces, available data becomes sparse.
Let’s set an example from our world. You can think about an input image that is being fed into a deep neural network, that each of its layers extracts features in different scales so that this image could be eventually represented at some point as a 2048 sized vector of features. Which of these features are the most important? What features contribute more to the purpose underlying our data structure? Luckily, neural networks are built in a manner that important features can be extracted from these high-dimensional spaces.
As much as we would like to, we can’t use all of the features. We just don’t have enough observations. Using thousands of features to fit a model when we have significantly fewer observations of the target variable would be a bad idea. We run the risk of massively overfitting our model, this would generally result in a terrible out-of-sample performance in the real-world scenario.
When we have too many features, observations also become harder to cluster. Too many dimensions cause every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a major issue. If the distances are all approximately equal, then all the observations appear equally alike (or equally different), and no meaningful clusters could be formed.
Without domain expertise and modeling experience, it can be hard to decide which features are truly worth keeping. This is where dimensionality reduction algorithms come into place. Helping us transform our set of thousands of features into an ideal set of features.
The Ideal Set of Features
If high-dimensional data has an underlying low-dimensional structure then the lower-dimensional structure can be extracted and explored. This is the concept behind traditional feature extraction and feature engineering methods. Meanwhile, feature extraction may employ linear combinations, projections, and transformations to extract new lower-dimensional sets of features with which to model.
If we could create an ideal set of features from scratch, then we should think about some main properties that this set would have:
Variance is basically how much the data bounces up and down. The more bouncy it is, the harder it is to predict or explain. Data with no variance has no uncertainty, so there is nothing for us to predict. Features with a high variance contain a lot of potential signals, which is a basic requirement for building a good and robust model.
Features that are highly correlated with each other are less useful and in certain cases downright harmful (when the correlation is so high as to cause multicollinearity).
Not Too Many
We wish to have a low number of features relative to our number of target variable observations. Too many features relative to observations would result in an overfit model that performs poorly out of the sample.
Today, we have demonstrated how having too many features could be harmful to machine learning models. In the real world of noisy, messy data and complicated relationships, dimensionality reduction represents a trade-off between a more robust model and lower interpretability.
Stay tuned for the second part of this series, where I’ll discuss the tools needed for combating the curse of dimensionality and explore practical algorithms for dimensionality reduction.