We define 3 types of data:
- Structured data – think of very defined tables
- Semi-structured data – like configuration, XML or JSON files
- Unstructured data – human consumable data such as images, videos, or documents
We will dive into the details of these definitions later but we can already evident that the structure of the data is critical for AI. We structure the data to build a skill or answer a set of questions.
This data structuring process is often referred to as “unstructured data management”. However, the name is very misleading because you’re structuring information to represent knowledge or skill. We call this structure of different properties, and their relations a knowledge base.
The Knowledge Base
Developing AI is essentially collecting different information pieces (data pieces with some meaningful interpretation) into a Knowledge Base (KB), this KB stores the different information pieces, their correlations, and relationships. Creating a digital (machine and AI-friendly) KB is very hard. Some might argue it’s even impossible since it’s very hard to define many of the things around us in a formal way that can be stored in modern databases.
Can you define a cat using a list of yes/no questions?
Try doing this exercise in your head. You’ll get the intuition on KB’s challenges and why they are hard to develop and maintain these incomplete systems.
You might think KBs is a new concept, but we’ve had KBs for thousands of years in the form of books and libraries, we just used human language to represent them and papers to store them.
Let us look at a very simplified information flow and aggregation into the Knowledge Base of the medical treatment knowledge:
We’ve been collecting medical information for thousands of years and have already established a medical information flow, constantly updating the human health knowledge base:
When you talk to your family doctor, the doctor represents all human medical knowledge regarding your questions, naturally, a single human is unable to have all the relevant information, and therefore after a few questions and exams (which are other forms of questions), your family doctor will redirect you to the answer, or the next expertise level according to your answers/results, an expert doctor that has more specific knowledge regarding your condition.
Expert doctors might in turn consult with literature and potentially other experts, sometimes going all the way to a global specialist in some cases.
Even for medical information, a domain that is a priority for us, we don’t have a single knowledge base that we maintain. Rather, the knowledge is distributed all around us. Whether a single KB for everything is even possible is an open question.
The above information flow between different knowledge levels exists in every domain. Every human topic has an information flow, knowledge aggregation, storage, and maintenance. AI is about capturing this information in a way that allows us to teach machines using this knowledge.
As of today, this knowledge is accessible only to humans.
AI development is about capturing the complex information we have around us into a structured format in the context of cognitive tasks we wish to automate.
Rule #4: AI model ➜ information container
It is now time to gain a better sense of the information before we decide on how to structure and manage it.
The AI Signal, Noise, and Channel
Once we understand that AI is information captured from our human knowledge base into machines, then we can conclude the next logical piece. Every exchange of information is considered a communication process. Therefore, AI learning is considered a communication process – a process in which several parties exchange information with each other.
Every communication process is defined by a:
- Signal: A measurable change that carries some meaningful (information). The meaning is what separates the signal from the noise and the meaning is subjectively defined by the signal observer.
- Noise: Random, unwanted modifications to the signal. Modifications can happen during capturing, storage, transmitting, processing, or conversion of the signal.
- Channel: The medium used to convey the information is sent from the sender to the receiver.
Let us reflect these into our cat detection task:
- Signal: The change in the image pixels representing a cat, is marked by the cat annotation.
- Noise: The none-cat parts of the image
- Channel: The image file
Who are the parties in the communication process?
- The sender: The human, marking the cat and sending this message (labeled cat image) to the AI model.
- The receiver: The model which decodes the message and uses it to “learn” a bit more about the cat’s properties.
Rule #5: AI development is a communication process
Labeling As Information Transmission
Labeling is the process of encoding information by a human data labeler on top of the data in the form of annotations. Later on, during the training process, this information is decoded back automatically into the neural network. A human encodes the message using his existing knowledge about the world, summarizing(compressing) this knowledge into data annotations.
The knowledge and experience, acquired by humans during the years of learning (either by formal or informal education), is transferred to the AI bot over the labeled data examples. Examples without labels are simply meaningless data points. Once data is labeled, it turns into context-based information units.
In the process of machine learning development, knowledge is often referred to as “domain experts”. Some domain experts like identifying a cat as common, while others like identifying sick orange leaves, or breast cancer cells acquired by doctors or farmers throughout a lifetime of professional experience. These are the experts from which our expert system is learning from.
The above is the core fulfillment of the fundamental business promise of AI.
Domain experts can transfer their skills to bots, creating labor automation across verticals and markets. Expert knowledge transfer is a communication process over labeled data where each example carries a bit more information regarding the learned skill.
To summarize, neural networks are a great tool for transferring information from humans to machines, yet do not let the “buzz” make you think anything intelligent happens inside a neural network. It’s more like a “monkey see monkey do” tool. Intelligence has and still is a biological quality, found ONLY in living things.
I still find myself surprised that after close to a decade with these technologies, the core process of AI, domain knowledge transfer from human experts to machines through labels, is ignored by an entire industry that spends most of its time today on cooler algorithms, faster processors, or labeling free AI. There is no human labeling of free AI, nor is there expected to be anytime soon.
Shannon’s noisy channel model is a useful template for deeper analysis of the labeling process as a communication process. Here is a chart that I find useful in understanding this:
Think of your model as a decoder that is trying to reconstruct a lossy compressed signal decoded by humans as labels. If that sounds correct- then a world of research opportunities is opened from sampling method analysis to mapping function and the source and the destination of spaces definition.
This book is based on these ideas and their meanings.