Imagine you’re playing a dart shooting game and you have 10 darts to throw:
For your throws, the result will be one of the following:
- Low bias / Low variance – You are an excellent dart shooter.
- High bias / High variance – You should probably find another hobby.
- Low bias / High variance – You aim well, but you need to work on your stability, stabler hand, or standing position maybe?
- High bias / Low variance – While having a stable throw, your aim has a big offset, maybe due to gravity.
Model accuracy can be viewed very much like the above. It can have an aim problem or bias and it can have instabilities or variance. Often it is a mixture of both.
Another way to look at this is if the model was a person answering a question:
- High bias person – Very opinionated person, like asking a religious person whether God exists.
- High variance person – Very confused person, like asking a toddler how to make a cake (sand and mud are legitimate ingredients).
Deep learning models are data-hungry, which means the more data they are trained on the lower the errors are. Both on variance and bias and with each sample, we are collecting, we are adding new information which refines our model both on the noise filtering, as well as on the signal detection.
The Brute Force Industry Is Born
In computer science, solving the problem by making many guesses until reaching success randomly is called brute force solution. A solution which indicates the one implementing it does not have a solid theoretical analytical understanding of the problem but more of a smart guessing approach, hence brute force solution is usually discouraged by academia and schools. AI is challenging this scientific approach heavily, and while once “it just works” was not acceptable reasoning on research papers, it has become quite common these days. In recent years, the entire AI industry has gone into “it works” mode. the academic crisis derived from this trend is out of the scope of this book as the market does accept “it works” well and delivers value, assuming it works.
The main problem with the above approach is that once it does not work there is little to no explanation why, and “add more data, use a bigger model” is becoming the popular approach, as adding more data decreases variance errors and allows usage of a bigger model that decreases bias error.
All AI companies experience what Google Health has experienced, eventually realizing that whatever amounts of data examples they have, it seems like it is never enough to cover all edge cases. The process for producing a single AI skill into the real world with acceptable accuracy takes 3-5 years for the simple-medium cases. But there is a bigger problem. Most companies will not understand it until they will fail at least once, which translates to a few years of the organizational learning curve until reaching successful AI adoption.
Overfitting and bias are a part of all AI products that go to market issues and they are solved upon encountering with the real-world data, the data your models will encounter in your customer’s environment. Now we enter the “AI launch paradox”: You want to sell your product but need the customer data for it to work with minimal errors, kind of like a chicken and an egg.
The “solution” today is simple. A customer sees a similar demo to their case and accepts a model calibration period. Since AI is in the early days many customers tend to agree with this phased approach since there are generally no alternatives in many cases. However, you should probably expect this will change as more mature solutions enter the market.
This learning on-the-go approach is practical however it can be a major problem for the industry and can even “make the news” sometimes when facial recognition apps perform way better on white people, males, or any other semantic property making it either immoral or illegal.
As of today, the common solution to overfitting and bias is to collect more data and use bigger models, we will cover other solutions in later chapters.
Can we expect these errors to go away as science progresses? No, these bugs are derived from our definition and reality, and both change with time and context. AI progress expects and requires us to have fewer data and less labeling. However, the above problems are here to stay, but hopefully with lower intensity.
The more important part is that bias and variance are critical ingredients to AI and a core component of it:
- Bias makes our memories useful. Without it, our experience has no meaning.
- Variance is what allows us to give attention to the important parts. Without it, we can’t attain a quick focus on the relevant areas of the environment.
Bias and variance are not necessarily bad for a model. But static bias and variance are bad for a model.
Unfortunately, neural networks are static, and the only way to change their bias or variance on a given data sample is with new training. Since our model bias and variance are static, we need to choose a balance point where both are good enough for us. This bias-variance balancing is often referred to as the bias-variance tradeoff.
Dynamic Bias and Variance
Given our task of cat analytics, describe the following image with a single word.
brown small size breasted bird and Siamese cat preview
Anyone without the context of our goal will likely describe this image as a “bird” since this picture has a clear, centered, frontal view of a bird. This is dynamic attention to details or the ability to separate the noise according to context.
Can you guess what animal is in the following image?
You probably guessed it is a cat and you are correct, by now your brain is heavily biased towards cat recognition, and even a very weak signal triggers the correct and biased conclusion. You know we are talking about cats in this book and correctly thinking to yourself “yeah, here is another cat”.
Your bias comes from the knowledge base (history) you have already created in your brain (what you have read so far), and that bias helps you get into the biased and correct prediction. However, someone around you, just looking at this picture without the historical context would have found it very hard to agree this is a cat, as the signal is very weak. Any neural network which identifies the above as cat is heavily overfitted (and likely to identify any large black areas as a cat), which means neural networks have no way to answer right all the time:
They can be either wrong, or strongly overfitted. This balance is often called the bias-variance tradeoff.
This is how you should think of neural networks. They are like that person around you that has no knowledge base, a fancy word to refer to them not having the cat’s context of this book.
This is an important concept- real intelligence has dynamic bias and attention built into it in the given context. , Therefore, bias and attention are not bad, they’re powerful capabilities (and weaknesses) our brain has. Another significant difference between us and neural networks is that our bias and attention constantly changes according to the context (new data) that we’re exposed to in real-time.
Rule #7: Dynamic bias and attention are part of AI
So, what is the above leave us with?
- Today’s narrow AI does not have out-of-the-box context, or knowledge base.
- Context is critical for creating dynamic bias (smart decision) and attention (noise filtering) or the network will underfit/overfit.
- Dynamic bias and attention are part of our intelligence and are done in the decision layer using our memory (rather than the perception layer).
- We must compromise our model results by choosing the right balance between variance and bias.
Our memory is much more sophisticated than just a space we use to store facts. Adding memory to your base models is not trivial, yet adding memory will bring some flexibility to the variance and bias trade-off, resulting in a more accurate model.
When trying to add memory to your AI decisions, you are likely to go back manual feature engineering era, with rules coding. If you are lucky to have a lot of data or a simulator, you could be adding this type of dynamic capabilities using memory-aware networks (like LSTM). The memory you add should probably come on the decision layer, or on top of the basic neural network.
Bias-Variance Modeling Exercise
Let us gain some intuition on how models are behaving regarding bias and variance. In this exercise, you are pretending to be the AI model.
Take and look and try to describe the following images:
Take a paper and pen, use it to describe each image twice:
- High bias description – use exactly 1 word.
- High variance description – use exactly 10 words.
What did you end up with?
When you used a single word, you have likely ended up with the word “cat”, which has very little ability to capture all the meaningful details of each image.
When you used 10 words, you have likely ended up describing parts of the background that are somehow meaningless.
How will it look if you have used 1000 words?
You can imagine the AI model size as the words count you used in your description; In AI world this will be measured in the number of parameters your model has.
Let us overview what we just did as if we were a neural network:
- We collected a training set of 3 examples.
- We modeled the data twice using two models: 1 parameter(word) model and 10 (words) parameters model.
- Using a low number of parameters (words), we got a highly biased model.
- Using a high number of parameters (words), we got a highly variant model.
While as humans the above scenario looks silly to us this is how modeling works and a significant part of ML work is searching for the right model size that has optimized balance the bias-variance errors while modeling the data. This is a balance between a too simple model to over complex model.
Finding the optimal model complexity is essentially balancing the bias-variance tradeoff we have just experienced with our image textual description exercise.
The Bias-Variance Tradeoff
So far, we have mentioned two very important concepts in machine learning: bias (also called underfitting) and variance(overfitting).
The machine learning expert will try to balance these two types of errors by adjusting the model size (or complexity) to the amount of available data and the problem definition:
- Small model➜high bias
- Large model ➜ high variance
A common mistake done today by many is taking the large state-of-the-art open-source models and training them on a small dataset, this leads to highly variant (overfitted) models.
Model size and complexity should grow as you collect and label more data, using a large model (as most of the open-sources are) on a small dataset (as most of the initial datasets are) is usually a recipe for overfitting.
Minimizing the Overall Error
Balancing the bias-variance is equivalent to minimizing the total error of our model. We want both low bias and low variance error; our simplified model errors are as follows:
Total Error= Bias error+Variance error+Other errors
The other errors and their solutions:
- Data management errors ➜ processes and tools.
- Data collection errors ➜ clear collection methodology
- Modeling errors ➜ ML experts
- Random errors (Irreducible) ➜ accept some errors are to stay
The accurate formula for the error is as follows:
Err(Signal)= Bias2+Variance+Irreduceable error
Further reading on bias-variance decomposition is recommended here.
Bias-Variance bug’s example
Look at these cat images and try to identify clear bias and clear variance bugs in relation to simple cat detection model:
All cats are lying down, we have a bias against other positions like standing cats.
Background always contains a sofa. The neural network will pick up sofa features as part of cat features.
If someday a future customer will ask us to analyze indoor cat behavior around the house. The sofa background noise will suddenly become meaningful, and the variance issue might become a bias issue (for now we don’t care about the sofa, its noise). just by changing our model definitions and goals.
As you can see, the best way to handle your data bugs is to “debug” those bugs by being extremely familiar with your data and be able to identify the error source or cause, once error is identified adjust your modeling architecture to handle these errors, we will cover this adjustment process in more details later.
Since we are debugging bad information transferred into our models, it is time to dive deeper to the process of the information flow and aggregation and the issues it may bring.