What can you tell about the below picture?
Data is a fact, collected and stored for the purpose of future access, it’s unchanged.
Here is an example of ancient data:
Let’s go over the properties of data:
- Fact – Data is objective, you cannot disagree with data.
- Collected – The existence of data means someone bothered to collect it, while data is objective fact, the collection process is biased.
- Stored – Data has some storage medium holding it, as you can see above stone can be data storage as well.
- Future access – The whole purpose of data is log of property in the present time for usage in future time.
- Unchanged – Data is fact and expected to have integrity, all future accesses will yield the exact same data.
Most of the data we have around us today is digital (binary) data. Data which consists of a collection of numbers (bits, 0’s and 1’s), stored on digital media like hard drives, or USB sticks. Data is not a random collection of bits. The bits we store and collect have patterns and meaning in them. This is how completely pattern-free, meaningless, random images looks like (AKA white noise):
Since we invest time and effort in collecting data, often, it will have meaning and future value (and hence patterns):
- The photos we took on our last vacation
- Our heart rate sample
- Our credit card monthly billing list
Once the data contains patterns and we give these patterns meaning, it becomes information. Can you see the patterns and identify the information relevant to cats in the following image? :
Information is a subjective, meaningful interpretation of the data.
I find the most intuitive way to define information as an answer to a question. While data is a “fact”, information is more of an “opinion” regarding the data. It is what we define as important characteristics in the data and the meaning we choose to give to these characteristics given our interest (context).
More technically speaking, information is any pattern (signal) in our data that we subjectively assign a meaning to:
Given the above answer is information, here are possible questions it might answer:
- What is in the picture?
- Where is the cat?
- What is the most important thing in this image?
Notice that we control both the question and the answer regarding the image, the questions we choose to ask and the answers we choose for them are the sources of our primary bias.
So, when we start putting labels, structure, rules, and relations into our data, we are making it both more informative and biased towards our end goal, whatever that goal may be:
- The photos we took on our last vacation were labeled with their location (GPS)
- Table with our heart rate per given activity (running, sleeping.)
- our credit card monthly bill categorized by expense type.
The moment we start structuring the data in a context, we are essentially defining information scope and start investing time and resources transforming the pure data samples into information pieces, information pieces collection that is relevant to the questions we would like to answer using the data.
While data is objective (a fact). Information is biased (on an opinion), biased towards the questions we choose to ask and answer using the data. Having a car in the image adds little information in the context of the “Is it a cat” question?
Moreover, once we define our question, we define the information space we cover, our definition’s bias will flow back our data collection process. If we are interested in cats then we should probably have all types and colors of cats while we can ignore stones, dogs, or birds.
One may quickly correlate the relations between data and information to thermodynamics laws and behaviors since reducing the entropy (randomness, chaos) of the raw data requires energy investment. Going deeper here will help you understand cross-entropy losses as well as sample error distribution.
The above definition of information directly suggests neural networks are essentially data compression algorithms, where each data sample is being compressed into its labels. Shannon’s source coding theorem is useful here in deconstructing how annotations(labels) are essentially compressing the data as a highly redundant alphabet.
Knowledge is a skill or theoretical understanding of a subject. If we look at information as a sentence (answering a question), then knowledge would probably be a book: many related sentences, in ordered paragraphs, structured in pages and chapters.
While data is physical evidence and information is a subjective statement regarding this evidence, knowledge is a philosophical concept. The philosophical study of knowledge is called epistemology. Knowledge is so abstract that we don’t have the ability to store it (we store only data), The AI field of trying to define knowledge into a structured format that can be stored is called knowledge representation
According to Wikipedia:
“Knowledge-representation is a field of artificial intelligence that focuses on designing computer representations that capture information about the world that can be used to solve complex problems.”
So, knowledge is capturing information to solve complex problems, let’s demonstrate this using a simple example, our knowledge as cat experts. Below is a cat image, what is the knowledge behind your decision this is a cat?
You have seen cats in the past and generalized them as things which match some these patterns (and probably more):
Notice how conveniently I used a “cat’s nose and mouth” to describe the cat’s features. We all know that cats have a very distinct nose and mouth, yet it is very hard to explain(represent) these patterns in words. The structure of a cat’s nose is unique enough for most of us to agree that the following is a cat makeup, mostly by the nose marks and whiskers, and dots:
Our brain stores the knowledge of what a cat is in a very powerful way, so powerful that we need a fraction of the information(signal) to infer we indeed see a cat thing. What exactly is the knowledge representation we store in our brain that gives us these remarkable perception skills? As of today, it is unknown to science.
Neural networks are very narrow, this means they have a very weak ability to capture knowledge representation. While they are good at automatically capturing features, they are weak in storing the dependencies, relations, and meaning these features carry.
What makes it so easy for a child to agree both of these images are cats?
Our brain stores the “what is a cat” answer in a special format. A format that is beyond our ability to understand as of today. We will refer to this unknown biological representation as to the entity descriptor. In these terms, we say that our brain has generalized the characteristics of cat into cat entity descriptors so well that we need very small pieces of cat patterns in a drawing (or features as they are called by data scientists) to match the above two entities as the same thing — cat.
While it looks trivial, this general inference capability found in human children is beyond the capabilities of the strongest AI out here.
What would have happened if we were learning like a neural network? If a child was learning like a neural network, then the simple lesson of hot ➜ danger would need to be taught many times with fire, matches, gas, hot oil, boiling water, oven, fire camp, and so on. Luckily for us, nature’s learning is much better than neural networks.
This is a critical fact and the source of today’s AI narrowness. It means neural networks are limited to the data and the labels they have seen in the past and need all the different cases of hot ➜ danger examples to learn the danger of heat. Today’s AI will find it hard to generalize that heat is the source of danger and will need to learn the list of all hot things in the universe.
Collecting all these examples and cases is what makes the AI development process so slow, expensive, and erroneous. Imagine how many special edge cases a self-driving car must collect. How many of the self-driving car projects are collecting data containing elephants on the road? Probably none.
In the AI development process, the basic thinking loop we have in our brain is translated into a data collection, label, and model loop. Where with each iteration we collect new data, label the new and interesting information, and update our model with this new information. While this thinking loop can run in our brain a few times per second, AI solutions tend to spend several months between iterations, our brain learns a billion times faster.
As we develop our AI application, we essentially run several (thousands in the case of self-driving cars) of these data loops, where each loop follows two simple rules:
- More training data ➜ More accurate AI
- Faster iterations ➜ Faster learning
Rule #3: Faster iterations ➜ Faster learning
If we don’t update our models with fresh data, representing the updated world (environment) state, our model decisions will degrade over time. This phenomenon is called model drift: a data distribution change in the world that was not modeled in our system during its training. An example for a major data distribution event these days is many face recognition apps need to deal with masked faces due to covid and failing to do it properly, once covid will be over another data distribution change event is expected.
From Code-Centric to Data-Centric
The way AI skills are developed is by showing the bot examples, the skill is learned automatically during the training process from human-made examples (i.e., supervised learning), we just push the machine a bunch of images containing cats and their labels, and the code to recognize cats is produced automatically, learned from the given examples.
This is the true revolution, no one will code cat detectors in our app, we will just collect, label, train models and get the code to recognize cats automatically. The even better part is that this code will recognize cats better than any code ever developed by humans.
To emphasize the above point let us imagine two teams competing to build cat detectors:
- Team A – A group of algorithms experts from leading institutions in the world. This group is working with Google using its latest algorithm and is well-funded with USD 10 million.
- Team B – High school kids that have passionately collected and labeled dozens of millions of cat images in the last 5 years.
Who will produce a better cat detector within a year? The high school team has much better chances.
Access to data, labels (cat knowledge) and computation power is the differentiation in AI, the algorithms are commodities in most cases.
We are going to make many learning (training) iterations, where each iteration has these basic steps:
Most of the AI buzz in the past few years has been around the model training phase:
- It requires a lot of special hardware and cloud resources.
- It is done by machine learning experts.
- It requires software frameworks like TensorFlow (Google) or PyTorch (Facebook)
- It involves a lot of experimentation where each experiment takes days to weeks.
We will pay very little attention to this process due to the following:
- It is the less creative part of AI skill development for 99.9% of the applications out there.
- We expect high-level automation of this process, referred to as AutoML.
- It is changing rapidly; yesterday’s state-of-the-art is old news today.
For the rest of this book, we will assume that the best possible model was trained using our data, ignoring the training process.