We start with a very simple model of our cat expert skill, in order to detect if cats exist in the image. Our bot is like a person who knows a single word: “cat”. We expect our bot to say this word whenever it “sees” an image containing a cat.
As we already know we need to collect data and label it to train our model. We collect images of cats from the internet, label them, and then do some training. A few hours later, our first model is ready for testing. We get amazing results where we correctly classify 99% of the examples we’ve collected. We go out to the street to test our cat detector and it fails miserably. We look at the data in order to find the mistakes and see the following:
While our training set contains mainly clear images of cats like these:
Congratulations, we have just overfitted our first model.
Can you see why our model fails?
Neural networks have poor generalizing capabilities, which is a fancy way to say that they need many examples of the same thing in different shapes, positions, sizes, colors, and backgrounds in order to learn how to properly detect a cat, or in more professional terms “generalize” our subject features into a model with a low error rate.
Usually, high accuracy (99%) early on in the development of a neural network is a strong sign of overfitting. Overfitting is a situation in which a network is very good with data examples it has seen but very poor with new examples that have never been seen by the model, like the issue we saw above.
The only way to make sure we’re not overfitted is to add more examples and make sure our model accuracy does not drop with the new samples.
Rule #6: The only solution to overfitting is more data
You might be thinking test sets and validation sets are the way to prevent overfitting. This is indeed correct but only on the local scope, these data divisions will not prevent you from the global overfitting of your model since at any given time most of your data tends to come from the same distribution (customer, location, time,). Just change the distribution (location, customer, sensor,…) and the different distribution will “expose” the overfitting on the past distribution.
Initial projects should pay extra attention to data collection variance, highly variant data will greatly reduce overfitting, yet tends to be much more expensive since it involves the collection of data from many sources, locations, sensors.
Before fixing our errors, we need to understand them better, it is very hard to fix what you don’t understand.
The biggest change deep learning has brought to the machine learning industry is the automatic feature generation of a signal. The automatic code generated using the training process just learns what properties make something a cat, and any error we see is an issue with these properties’ detection or classification. It’s time to dive deeper into these properties or features as they are often called by professionals.
Let us go back to the cartoon cat, and put it next to a real cat image:
Our brain can detect the cartoon drawing as a cat because we are able to extract the meaningful characteristics of cat and correlate these between the two images, we can imagine our brain correlation as something like this:
Inside our brain, we store the characteristics of a cat or its entity descriptor: The pointed ears, mustache, nose structure, body structure, and many more properties, we do not try to remember all cats we have seen during our lifetime, just the symbolic representation of a cat. At any given point our brain uses the entity descriptor of a cat and correlate it with the image we see, even if it has only a few parts or just a vague outline, like in these examples:
The classic (pre-neural networks) computer vision development of cat detection algorithm would involve feature definition by domain experts and then, feature coding by an algorithms developer. Machine learning experts were trying to detect each feature (as it was defined by the domain expert) and when enough features were detected, a classification algorithm decided if it was indeed a cat. This process was an attempt to manually code the entity descriptor, classic computer vision model with 70% accuracy was considered very good and took a few years of effort, neural networks tend to reach 85% early on, often within weeks.
With neural networks, the process of manually coding an algorithm to detect a cat’s mustache is gone. As we will see it is still critical to define the features of the subject (cat) with the domain expert (a cat expert), but the long, expensive task of coding many features recognition by top computer scientists is no longer needed. We just collect more cat examples, and our neural network will learn these features code automatically during the training process, creating a digital approximation to the entity descriptor.
This automatic feature learning is at the heart of AI democratization, and one might argue that the first jobs AI eliminated are those jobs of computer vision features experts.
The neural network is an approximation, which means it has errors, these errors mean our network is either giving the wrong attention to the background by capturing meaningless features (noise) or it is ignoring important features that our subject (signal) has like “whiskers”, since the process is automatic, we do not control what the network is “choosing” to look at.
We distinguish between two types of errors:
- When our network ignores important features (partially or fully) we call it bias error.
- When our network pays attention to noise as an important feature, we call it variance error.
These errors are the bugs of the data world in the AI era. Can you imagine a team developing a software product where some of the team members are not familiar with the concept of bugs?
In AI today it is not rare to see business, product, and management team members making the mistake of considering these terms as technical details that are out of their professional scope. These are critical concepts to understand for anyone working on products with machine learning capabilities as a core value proposition.
Hello Data Bugs – Overfitting & Underfitting
As we have seen, a feature analysis has two error types:
- Bias error, also known as underfitting.
- Variance error, also known as overfitting.
What is the source of these bugs? And how can we fix them?
Bias error source
Bias is the difference between our model results to the correct results, where the correct results are subjectively defined by the AI creators (domain and machine learning experts). To illustrate, if our training data lacks Siamese cats’ examples, the model is more likely to miss Siamese cats. this is underfitting.
Variance error source
Variance error comes from the environment around our subject, the noise (or the background). It means our model contains some environmental unrelated features as cat parts while these features have nothing to do with cats. To illustrate, imagine many of the cats have a collar in the training set. The network is likely to learn that the collar is a cat feature and will identify just a picture of the collar like a cat even if the cat is not present in the image. This is overfitting.
Our brain can capture a context-adjusted symbolic representation of things, being remarkable at identifying meaningful features as well as filtering irrelevant noise, neural networks on the other hand do not, and this is the main cause for neural networks being a narrow AI tool.
While this subject is beyond the scope of this book it still gives us some understanding of a core assumption going forward with our development: neural networks need a lot of examples to recognize narrowly defined things.
With this understanding we prepare for much broader data collection and labeling operation, but where do we get the data from?
Real-World Data as the Key
With very high sensitivity to data distribution changes, we understand that it is better to collect and work on real-world data, where the real world is defined by the data our customers will see.
We install 10 cameras on many popular cats’ locations and a few months later we have a lot of data that we collected, filtered, cleaned, and labeled. Our model performs well with 91% accuracy on almost every image it sees, we even invite friends and family to test it and get great feedback. We are ready for launch.
We find our first customer after a successful demo and set a pilot:
- Our customer is ok with just a cat counting for phase 1 and delaying cat types of detection for phase 2.
- Our customer has 10000 cameras across the city, the pilot will start with just 100.
Notice our customer’s pilot data volume is 10X our internal pilot, customer’s production is 100X more data compared to customer’s pilot.
A few weeks later, after hard work of setup and data engineering, the AI does not work as well as our pilot. It’s ok most of the time but making stupid mistakes and far from the 91% accuracy we have seen (and promised), some of the pilot mistakes:
More edge cases…
We gradually understand that overfitting & bias challenges are much bigger challenges than we initially thought.
The above model crash scenario is not rare, on the contrary, it is a typical phase all AI products go through as they scale to more data, locations, scenarios, and sensors, and every company working with deep learning networks is struggling with these types of behaviors when developing AI products. The correct way to develop AI is to create a data flywheel effect from day one, where we put in place the infrastructure to collect and learn from our customers and users as we launch. This is tricky, we need the data to get a working solution, yet we need a working solution to get users and acquire the data. When an AI project fails to create a data flywheel effect it fails, even if it is done by Google.
The Google Health Diabetes Detector Case
Research papers from the last few years have shown that analysis of a retina image (image of the back of our eye)that can detect diabetes early on and prevent blindness and other complications of the disease. This is how a typical eye retina image looks:
As you can see, the environment(noise) is stable, and the signal (blocked or leaking blood vessels, dot-like artifacts) is well defined for the human experts. Human experts (retinal specialists) will identify the problem with around 90% accuracy.
In 2016, Google announced an AI model which could replace human experts, allowing massive automatic diabetic retinopathy screening. Google did extensive testing and managed to show accuracy above 90% in labs, matching the human expert’s performance. With great lab results, it was time for a pilot.
Thailand’s Ministry of Health set a goal to screen 60% of the population, 4.5 million patients. In its current manual process, a nurse will capture a patient’s retina image and send it to a retina expert, results are then given within 10 weeks. Thailand has 200 retinal specialists in the country. The value of AI could not be clearer, 10 minutes instead of 10 weeks, screening an entire country instead of a fraction. Google went on a real-world pilot with the Thai government in 2018.
During the pilot, the experience was amazing when the AI worked, but what Google learned is that the nurses are scanning dozens of patients in an hour, sometimes with poor lighting conditions. There were bad images, bad devices, and sometimes internet connectivity issues. 20% of the patients had an error with the diagnosis, forcing them to do it all over again in many cases where there was no indication for the disease at all. More than that cloud connectivity had slowed down the entire process, leading to more mistakes and frustration. The pilot ended with a significant amount of patient frustration during 2020.
Emma Beede, a UX researcher at Google Health: “We have to understand how AI tools are going to work for people in context—especially in health care—before they’re widely deployed.” Google Health is now working on fixing these issues and I am sure it will be successful.
Google launched an AI product without a data feedback loop. Most projects doing that are likely to end with the same conclusions: feedback is critical for creating a data flywheel effect.
Managing continuous learning data pipelines, combined with the human experts’ feedback is a pretty complicated process, so complicated that Google, Global AI leader, had significant challenges deploying very well defined and narrow models into mass production, mainly because real production data managed to surprise the AI model and workflows.
Another aspect of deep learning model development is that it takes a very long time to collect, analyze and conclude results from the data, and while getting release feedback on traditional software apps takes weeks or months, AI requires years to get into maturity.
Here is Google timeline on this project:
Data feedback loops are much slower compared to traditional software or code feedback loops (bug fixes), which makes product ramp up slower and higher required initial investment by companies who wish to deploy such solutions.
With bias and variance errors elimination being so critical to our time to market, we need to put proper attention in understanding them by all the company employees: Management, business, operation, Researchers, and developers.