In chapter 1 we set the goal for most of AI out there, allow the automation of the cognitive part of human tasks.
In chapter 2 we saw that AI has sparked the data era with a new paradigm of processing and storage.
In chapter 3 we defined the basic terms of data, information and knowledge and showed how the basic loop of data collection, labeling and modeling is a mimic of our thinking loop.
In chapter 4 we saw that knowledge is learned by communication of information pieces over data where the labeling plays the role of information transmission.
In chapter 5 we saw that a single or few iterations are hardly enough to make successful AI due to data bugs.
In chapter 6 we had dived into those data bugs and their meaning.
In this chapter we will close the loop by gaining intuition on what information is, how it is aggregated (e.g., forming knowledge base) and the information dynamics of a single loop iteration.
From high level perspective all machine learning products are trying to create a data flywheel effect:
This effect allows a given product to dominate its market over competition with very simple logic: The best product will have most of the users that will generate most of the data which will be used to make the product even better.
Here is the AI launch paradox again, this paradox is very similar to the junior paradox where a recent college graduate is looking for a job: he needs experience to get a job and a job to get experience.
This is the core power behind the AI rush you see today among corporations and nations even though most AI activities do not generate profit, it is a data collection race for getting fully functional solutions since the market today is willing to accept partially functional solutions simply because there is no alternative and competition levels are low. You can expect that as the market matures spinning data flywheels from scratch will become very hard since you will have to win over mature solutions backed by data collected over years, use cases and locations. Lucky for us, we are early in the market with our vision and cat analytics app doesn’t have many competitions and alternatives, which means buyers are willing to deploy a solution with limited capabilities and reasonable error rate, we on the other hand promise the solution will learn and improves at is exposed to more of the customer’s data. This improvement process is the distillation of information out of the data, or the ongoing compression of our target signals’ information into a representing training set. A simpler way to put it: the more cat images we collect from different types, scenarios, and environments, the better our AI will be.
We defined information as observation over the data, let us dive deeper into 0the concept of information.
What is Information?
We tend to use this word easily almost every day, information, and we usually do not pay much attention to its meaning, yet in AI products information is everything. but what exactly is it and do we measure it?
From Wikipedia: The English word “Information” apparently derives from the Latin nominative information: this noun derives from the verb īnfōrmāre (to inform) in the sense of “to give form to the mind”, “to discipline“, “to instruct“, “to teach“.
Can you quantify what exactly you added to your brain when you learned to read, ride a bicycle, drive a car, throw a ball, or sense insults and humor? It is very hard, yet this is exactly what neural networks are trying to do upon training, compress all the information exists in the examples it is exposed to into collection of numbers that magically capture the definition of the target signal, a cat in our case, mimicking that cat entity descriptor we have in our brain. Can we measure information like we measure room temperature or car speed? The answer is yes.
Information is measured in bits (or shannons), you can imagine a 1 bit of information as the answer to a single random Yes/No question, e.g., a coin flip result. Here again, the pure definition of the information unit matches our previous observation, information is an answer to a question and one unit is an answer to a single yes/no question. Note, the confusion of naming, information and data are both measured in bits.
Are you familiar with the “20 questions” game? In this game one player chooses a random unknown secret word and the other player has 20 Yes/No questions to identify the secret word, this is an information communication game. Each answer potentially carries a single bit of information, where the asker has potentially 20 bits of information communication budget to reveal(decode) the hidden word. Just like there is no point in asking the same question twice in that game, there is no point in having the same identical example in our training set twice. Neural networks are like many small (yes/no) questions chained together in layers, organized to answer more complex questions, and the training process is essentially calibrating these micro questions to answer the big question using the supplied examples.
Can we transform the “what is a cat” question to yes/no question list?
If we tried to break the complex question of “what is a cat?” to a collection of simpler yes/no questions and their associated number of possible answers, some of them might have been:
- pointed ears
- 4 legs
- a tail
- 2 eyes
Each question we add, adds another (potential) bit of required information yet it’s hard to imagine us creating a full and comprehensive question list, the neural network training process creates the list automatically for us (note the list is unexplainable to humans like the above list).
The complete information set that answers the question “what is a cat?” given image pixel values is unknown and it is very hard to separate the big question of “what is a cat” to the millions of small questions that are defining cat, yet our brains do that seamlessly.
The biggest breakthrough of AI is that it automatically learns the answer, breaking the big question to smaller ones, creating a magical structure of features (represented by neural network weights) that estimate how cats look, that is the main reasoning for those who correlate neural networks to our brain.
Note that the information capacity of a single answer is indeed 1 Shannon if that answer is fully independent to all others.
We continue to notion of AI is a communication process and therefore information formal definition can become very handy, The Shannon definition of information source entropy is defined as follows:
If we measure the entropy of single, non-biased coin flip:
Incase our coin is fully biased towards head or tail, that means that 100% certainty result is meaningless and carries no information:
I=−0∗log(0)−1∗log(1)=~0+0= 0 bit(shannon)
Looking at single bit distribution we see that entropy is maximized on evenly distributed random variables for a single true/false experiment.
If we think of data collection as an entropy source, the above notion can be used to decide which data is the most valuable for your collection-label-train pipelines. You could speedup information collection by creating coverage maps using your model features and increase your data collection efforts from weakly uncovered spaces during inference time.
How Much Information Do You Need to Define a Cat?
If we could make a list of accurate yes/no questions that map every image in the world to a cat / no cat image, the number of questions in that list would be the information capacity of a cat, yet we don’t know the questions list, therefore we trained a neural network to begin with. Let’s approximate.
Approximation 1 – Existing neural network size
If we take a common neural network which can detect cats reasonably well, we get a network in the size of a few millions of parameters.
Approximation 2 – complete cat description
If we agreed that the Wikipedia cat’s page gives a good description of cats, we will get into 20 million bits of information, 2 million after compression.
As you can see, we can approximate the “what is a cat” information into the area of dozens of millions of bits. a single color 400×400 image of a cat containing ~ 5 million bits of data and we need at least a few thousands of images for training a reasonable model. Why do we need 1000X times bits of data for 1 bit of information?
New data that does not add new information is worthless and a waste of resources, in professional terms this data is called dependent data, data that is like the data we already have. That is why we could not just take billion images (data bits) of the neighbor’s cat for training the perfect cat detector, after the first few images our model is familiar with that specific cat features and more images of the same cat will not add new information, more information means images from other cats, in other locations, sizes, colors, positions, and environments. If we don’t add more diverse cats’ images to our collection, then our model will be overfitted to detect our neighbor’s cat. So, every example we collect is indeed 5 million bits of data, but the information added by it is very low and in the case of very similar examples (images) already existing in our knowledge base, it adds practically 0 bits of information.
So here is the true remarkable magic of deep learning, it learns how to separate the big question to millions (or billions) of small questions, automatically from the labeled data and while we cannot break the cat question to millions sub – questions ourselves, we can collect millions of images containing cats and let the neural network figure the question list automatically during the training process of the neural network.
Neural network training is essentially the process of creating an automatic description to a cat: finding the numerical definition of the different features defining a cat and the relations in between them. One can imagine that each cat example adds another feature, property, rule, or connection to the neural network and when enough of these pieces are collected the network can identify cats by itself while not confusing it with mouse, dog, or car. The training process is compressing the cat’s data examples into a series of numbers, we cannot really tell why these numbers are structured the way they do (neural network decision explain ability is a big pain point), yet we can tell the are containing the question list which answer the question “what is a cat” all the way to the pixel level.
This teaching process is very similar to the process we go through on our own education as humans, the big difference is that while traditional teaching has both human teacher and human student, in the AI world the student is a machine, and the way lessons are delivered is quite different: There is no classroom, lessons or books. Just positive and negative examples.
Let us assume we used these images for our training
From our model perspective, all cats in the universe are Siamese cats.
Which one of the following is likely to add the biggest amount of new information regarding cats?
What information is added by each one of the above?
- A young Siamese cat
- A drawing of a Siamese cat
- A scribble of cat
So, the biggest amount of information is added to our training set is likely by cat #3, cat #3 has the largest information capacity in the context of our existing data. notice that while example #3 contains features that we will never encounter in real life, as of today neural networks are narrow which means it’s better to avoid none-realistic examples even though they are adding more information unless you really know what you are doing, future networks might be able to utilize these (none-realistic) information rich examples to generalize better.
The more different (and rare) the new signal features relative to our existing signal features, the more information it is adding to our dataset.
Now imaging you are developing self-driving car, images of empty road will not improve your self-driving car perception while a “deer on the road” image represent much rarer event, therefore adding much more information:
The above is critical for anyone developing AI systems, the process is not about collecting data, it is about collecting information which means edge cases and the more data you have the harder it gets to get new information, this behavior is what makes data collection and labeling a diminishing return process, a process in which makes the next leap in our AI accuracy 10x more expensive compared to our investment so far.
Our task of collecting data is better defined now, we need to collect enough information for our network to properly capture the signal with acceptable accuracy, enough different and unique images of cats.
A typical availability of the information of given signal will look as follows:
What we learn from the above is that around 50% of the cases needed for ideal information modeling are not common, covering them will require a lot of data, time, and budget. This is the information long tail, often referred to as edge cases collection. Luckily for us, our customers do not measure us against the ideal model (which is unknown), they measure us against what they see, which means overfitting and bias can be used this time in our favor since the customer will measure biased results, visible for their own data distribution. This bias and overfitting will make the measured accuracy as it is seen by the customer much better than if it was tested on the overall complete distribution.
Training Set Balance
Balanced training set is a training set which contains an even number of examples of the case types we are looking for. Looking at our existing collected data, we have the following data distribution:
This is not a surprise since 95% of the cats are domestic cats, more commonly named as house cats.
Naturally, there is no balance, which will cause our model to be heavily biased towards house cats. In an ideal world we want our training set to be varied (have all types of cats) and balanced (have enough examples per cat type) but generating such training sets takes years (how many Bengals have you seen lately?), so we must compromise.
The compromise will be based on our business needs, what is good enough to bring value to our customers or users. While self-driving cars and medical AI systems have very little room for compromise, luckily for us, our cat analytics customers want good cat detection in general, and did not ask to give special alerts on Bengal cats just yet, cat type detection is something we agreed to phase 2 of the system deployment.
If you have a mature product, you probably already have an abundance of data (from your existing customers/user base), but what happens when you start from scratch? Enter the measured accuracy.
The measured accuracy is the biased accuracy that is measured by the customer, the accuracy as it is measured only on the customer’s data as it is sampled naturally. Imagine from your model’s perspective, only this customer exists in the world (which is correct until the second customer joins).
Overfitting/bias means good accuracy on the data you have already collected which means that while your model is not capturing the cat information signal ideally (i.e., generalizing well on cats of all kinds, scenes, and positions), users will see it functional well most of the time since it is trained on their data and heavily biased and overfitted towards it, let’s take an example to demonstrate the measured accuracy dynamics:
Our first customer is monitoring cats on the cities, 99% of the time they will capture images of house cats like these:
But occasionally (1%) we get more rare types and scenarios, like these:
If we define two categories: common cat and exotic cat, let us calculate the real accuracy VS the measured one:
Our customer takes 1000 images a day:
- Our accuracy on the common cat is 90% and if not then we have a daily constant supply of the common data, after all it is common.
- The accuracy we have on the exotic cats is very low, 10%, it will take months of data collection to improve it.
But what is the accuracy measured by our customer?
Every day our customer uses our analytics on 990 common cats and 10 more exotic cats:
As you can see, our biasing in the data and modeling towards the common cat is what makes the measured accuracy match the customer’s real-world measurements. Our customer sees perceived accuracy of 89.2%, even though our model is only 10% accurate on the less common cat types.
We have just seen something important, our data flywheel effect allows us to learn fast new customer environment and if that specific customer observes good performance it is acceptable, so learning fast with every new customer, producing good results locally for that customer while keep building our more generic training sets and models can be a successful strategy for scaling our solution usage.
We can continue our cat development while collecting more data over time, increasing the information coverage we collect (i.e., special cases) gradually from our customers’ data and still serve acceptable measured (biased) accuracy. A fancy way to say: if our users are happy, we can deploy the product and keep collecting data and improve.
The above is an important point all AI solutions encounter, the data you need is at your customers/users/production environment. Building AI requires all units to work together to create the data flywheel effect:
- Business: Clearly communicate performance improvements over time and handle data usage rights
- Operation & IT: data pipelines from production (mass amounts) to research (interesting cases) are critical for information flow analysis.
- Research and development: Adjustment of the model’s granularity to the data volumes and error types over time as we scale our deployments.
Measuring Information Exercise
We gained some intuition on information capacity, let us take this intuition one step deeper with an exercise.
Let us take a simple toy example: a 4-pixel black and white image as our next example and a simple neural network that processes it. No worries, still all is required is some common sense.
4-Pixel Information Image
While 4-pixel black & white image has no use in the real world (it carries very little information), it is a greatly simplified case for us to dive in and learn how neural networks respond to data examples or the lack of some, 4 pixels back and white image can be represented using 4 data bits. Whatever we see happens here also happens with larger images/audio tracks/signals/text files.
If we have 4 pixels, 2×2, black and & white image then our complete data distribution space has the following 16 categories using 4 bits:
Let us mark these 4 pixels with Bottom/Top and Left/Right markers
Given a random image we can immediately classify it into one of the 16 known categories by asking 4 yes/no questions:
- Is TR black?
- Is TL black?
- Is BR black?
- Is BL black?
A possible neural network that doing this classification might look like as follows:
This neural network is not smart, it just connects all possible results from our input pixels to the possible outcome. If we now feed it with data, it will learn how to map the 2×2 image to its category.
Each neuron in the neural network “in charge” of answering a single question: when the answer is Yes, we call it activation and the neuron is excited (answering yes) as a result.
As we have seen takes a big question(“what is the image category”) and disassemble it into list of much smaller questions (is pixel 1-4 set), the neurons are organized in layers (as you can see above) and during the training process on the data, each neuron in the layer learns when it is time to say yes. We do not control how each neuron responds, we only control the data we feed the training process and some parameters on the training itself which are not relevant for this analysis. Time for some data and information testing
It is easy to generate our data synthetically and do some testing: We generate 10000 random samples and train the neural network above, we then run the network on new 1000 random examples, expecting it to learn from the training. out N marks how much confidence the network “think” the example belongs to category N (1 is 100%), the Category column shows the real category, or the ground truth:
You can see the network learned perfectly with our 4-bit analysis; this simple network can perfectly classify anything in the world of 4-pixel images. The above view is called confusion matrix and is a basic tool to review model results, Perfect diagonal line is what every machine learning expert would like to see from her model.
But what happens when we exclude specific examples during the training process? We create a hole in our data distribution, which means we remove all samples of category 4 and retrain on the incomplete set:
After retraining with hole in our dataset we get the following confusion matrix:
See what happens:
- On category 4 row, the network is simply going crazy, unable to decide how to classify a given data sample it has never seen, thinking it’s 50% 4 (0.5) and 100% to categories 5,6,12.
- Out 4 columns are never 1, the network will never decide an image is category 4 image.
- The networks did not even learn that a single sample must be mapped into a single category and assigns multiple categories to the same sample, clear overfitting behavior.
This erratic behavior, which we see in 4-bit images and single data holes, exists in all AI products and the only way to fix a network is by collecting the missing data pieces and retraining the network.
How well the data is covering the target signal properties is called the signal coverage, our labeled data was sampled from the world, assuming (wrongly as we already know) it represents the world.
This is a common signal coverage/overfitting behavior, and every AI system has the following data domains:
Each data sub-domain assumed to be representing its container data domain distribution, wrong assumption until enough data was collected, process which usually takes years.
4-Bit Information Signal Coverage
In an ideal world we would collect (and label) only 16 images. More than that means we have redundant information which adds no value (but costs us money to collect, store, label and process) and less means we have data holes (missing information) resulting in errors like the one we have seen above. The data we collect from the real world will often have redundancies and incompleteness such as follows:
When talking about information coverage, the important question is: “do we cover the real world properly?”, where too much data means wasted time & money and too little means our model has a high error rate due to missing information on the target signal.
The process in which we collect more data to gain more information is a diminishing return process, a process in which the more information we already collected, the more expensive it becomes to get new information, eventually leading to new information being too expensive to collect and hence halts our accuracy improvement / error reduction.
Data holes & diminishing return example
Let us take the 2×2 image world as an example to demonstrate the diminishing return behavior. Our process in this example has the following properties:
- We collect a single image every day.
- There are 1/16 chances to hit every category since we have 16 categories.
- Our overall information content is 4 bits.
On ideal world we would collect data for 16 days, each day adding the 1/4 bit of information, covering 100% of the signal coverage in optimal data collection efficiency.
How much information do we add on the first day? 1/4 bit, since we have no data, any image will be new and will teach our model something valuable.
But see what happens on day 2:
There is 1/16 chance we collect no new information (the same image as the day before, adds nothing) and 15/16 chance we add new 1/4 bit of information the average of day 2:
The average added new information keep decreasing every day:
And the overall average information is collected over time:
On day 2 we collected less information on average, compared to day 1 even though we invested the same effort in collecting and labeling data, as the days move forward our chances of hitting “unknown” category drops.
Every day we invest the same effort for getting fresh and updated data, yet the average return (information) keeps going down and Information is getting more expensive as we progress. This is the training data diminishing return process and demonstrates why going from 99% to 99.9% model accuracy costs 10 times more.
Rule #8: Cost is driven by data, Value is driven by information
The above exercise is assuming we do not do anything smart during our collection process, just doing random collection of data. That is a naïve assumption. In most cases we tend to have more available data which opens the question, which data should we keep given we have a budget only to store, label and train a fraction of the overall data? If we build a feature coverage map, we could use it to give our data score before we invest the resource.
The diminishing return process does not guarantee results, just probabilities to have 100% coverage after N days. A numeric example to demonstrate the process.
Let’s assume we collected all possible scenarios but one, just like our previous category 4-hole example. How many days of data collection & labeling will it take us to close the hole and reach 100% coverage?
The answer is that if we are unlucky, it may take forever, just like you can flip a coin many times and theoretically never hit your head, so we must talk about probability
or chances. the chances for us to cover our data hole during n days of data collection:
So, with our single last missing piece of information, we need to collect data for 70 days to get to 99% confidence we have covered our last single image hole, we invest 70X more for the last piece of information compared to the first one:
- Our first information 0.25 bits cost was 1 day of work.
- Our last information, 0.25 bits, was 70 days of work, and even then, we are only 99% confident that there is a small chance we still have a hole after 70 days of work.
Feature Coverage Map
We need to be able to score the samples of our data with their information capacity before we invest in communicating, storing, labeling, and processing.
We do that by creating feature vectors for each sample, a feature vector is a “summary” of the data example, calculated automatically in an unsupervised way (no need for human touch).
Usually, a feature vector will be a list of numbers, we may know each value’s meaning or not (usually we don’t, just like with neural networks). We can imagine our cat feature vector to the cat feature list we defined before:
Using our data feature vectors we can create the information map which is a kind of coverage map that will help us estimate the information value of a new sample before we invest most of the resources to learn from it.
You can imagine this map as a data filter.
The most basic feature vector you could generate is the output of the final layer your neural network produces upon inference before any labeled data available self-supervision or headless inference on a pretrained network can be used for the initial feature vector.
Since information is getting more expensive as our solution evolves, lowering the costs becomes vital to sustain a valid business. Fortunately, the data flywheel effect can also be used to optimize the information acquiring costs.
What are the costs that are associated with each example in the process?
- Collecting the data
- Running the detector model (often called model inference)
- Communicating the data
- Storing the results
The first step for optimizing the costs is creating a funnel, where data continue to next phase only if valuable (adds information), the feature map we have created now becomes handy, as it can be used to filter the relevant data:
At every step our existing information is used to save some work on the data, examples:
- Discard corrupted examples upon collection
- Send to cloud only anomalies
- Store only low confidence examples and anomalies
- Select only high information examples for labeling
- Pre-label examples and let humans only approve
- Shorten training time by removing redundant data examples.
Let us conclude the assumptions we carry on with us going forward:
- The answer our AI gives is based on patterns it is trained to detect.
- Neural networks models are essentially information containers/compressors.
- The ideal dataset is a complete representation of the signal’s information, but it does not exist until we gradually build an approximation of it, in production, on user’s data.
- Data volume matters less than information coverage (data variance), single edge cases add much more information than many common examples.
- The fastest way to have good information coverage is to use our end user’s data, products based on internally collected / lab data alone are very slow to converge at all.
- AI is not a project with completed milestones, you build a learning system fueled by data, you then try to create a data flywheel effect.
- You will run many iterations on the data flywheel template through the entire life cycle of the product and have a lot of room for cost, accuracy, and speed improvements.
- The AI development process is organizational, it requires Business, Operation and Development collaboration.
The data flywheel effect main goal is speeding up the information communication between humans and machines, popping up as fast as possible new edge cases.
Information purpose is communication and communication requires language or protocol, time to go deeper into the programming language being used to encode this information.
Up until today the only people who communicated with machines were programmers, and they communicated the machine instructions using code, software.
The AI language is a new and disruptive way to communicate or program machines and it is the true power kicking the era of machine teaching, where human experts are teaching machines to create the next digital automation wave without the need to write code.
What is this new programming language being used in this new classroom to transfer information and knowledge from humans into machines?
Time to dive into the machine teaching process and the programming language that enables it – data annotations.