Chapter 8

Machine Teaching

It is very popular to talk about machine learning these days while ignoring the teachers in this learning process, time to discuss the machine learning process from its less common perspective - as a teaching process.

In this chapter, we will dive deeper into the programming language used in the machine teaching process, the data annotation language that is used to communicate information from humans (the teachers) to machines (the students), enabling automation without coding software, igniting a no-code automation wave.

Machines will always learn from humans the meaning of patterns found in data:

  • Unsupervised learning, the teacher is the data scientist, an algorithm developer, or a software developer
  • Supervised learning, the teacher is the data labeler


The current AI wave sparked by supervised learning, an approach that is considered analytically and academically inferior, has started showing better results compared to unsupervised algorithms in many cognitive tasks like computer vision, translation, and others.

With supervised learning becoming even faster and better compared to unsupervised learning the entire industry has started scaling towards supervised learning, where for the first time the machine teachers are not coding the rules and algorithms using programming languages but by feeding the machine with examples and letting it figure out the rules by itself.

Real world applications tend to contain a mixture of both approaches, but since there are many other books on the different traditional aspects of programming and machine learning, this book focuses on the supervised learning, the approached that is less discussed and ignited the current AI wave, enabling the creation of cognitive and knowledge automation simply by data examples and domain experts. This era has already sparked a new industry of data labeling, where millions of workers are already employed in labeling, cleaning, and organizing data. Those workers are the machine teachers, and their numbers are expected to keep growing significantly in the coming years.

The challenges, requirements and pains this new industry is facing are all new and the progress done in this space is setting the pace for the entire global AI progress since most AI teams today are often blocked by lack of fresh data and labels, often referred to as “data hungry” teams which are constantly starved.

ML Point

Synthetic data is becoming a popular topic in the last year yet it does not change anything fundamentally. In the synthetic data learning process, the teacher is the human setting the synthetic generator rules and patterns, this person is not labeling data directly but puts the rules to create labeled data. The fundamental principle is kept – the human is the teacher.

Teaching: A Knowledge Transfer Process

Before talking about machine teaching it’s worth talking about teaching in general as many of the principles applied for teaching as we know it from our daily life are relevant to machine teaching as well. Furthermore, as we will soon see, human teaching also plays a critical part of the knowledge flow into machines.

Teaching is the world’s largest profession when measured by the number of workers, our education systems span from elementary school all the way to universities . Our society has established complicated and detailed systems for knowledge transfer in all fields that are relevant for human life, where the average human spends decades acquiring knowledge through the different educational systems.

Almost every field of our lives is studied by experts in depth and have some degree of knowledge transfer hierarchy from top world experts to knowledge end users while maintaining a Knowledge Base of some kind, this knowledge transfer process exists around us on almost every domain of our life:
Professional, science, religion, law, society, sports, health, culture, art, mystics, athletics, and many other domains.

Each domain has its own authorities, experts, books, research efforts, progress, and history. This knowledge is evolving through time and updated as we discover new things about the domain, constantly communicated from authoritative individuals and organizations to its end users.

The knowledge about cats is not different in that aspect, that there is a knowledge flow from global cat experts all the way to pet owners. As of today, knowledge’s final destination is always humans. AI is about continuing this information flow into machines.
Here is a possible view of this information flow path in the context of cats:

Teaching is an information communication process. This core principle does not change when it comes to machine teaching, machine teaching is just adding two new points to this information flow path:

  • Domain expert 🠚 Data labeler: Human to Human
  • Data labeler 🠚 AI model: Human to Machine (bot)

AI – The Next Communication Leap

It is time to take a look again at AI from 30,000 feet. In the first 30,000 feet view we looked at AI from the industrial/economical value perspective of human labor automation. This time, we will look at the same process from the scientific and social progress perspective.

The process in which we collect data, label its patterns, and transfer this information into a machine that will automate the pattern detection, is a communication process in its most pure aspect. A new communication method between humans and machines and as we will see, every new communication method in human history has created significant disruptions across many aspects of our life: health and quality, social, economic, industrial, educational, and political.

Before AI, the only possible human communication with machines was through code, distributed as software applications, where programmers coded the accurate clear instruction sequences which were later executed by machines, and the languages used in this process are the programming language. The ability to communicate with machines started about 60 years ago and sparked the most dominant industry around us, the software industry which is said to be “eating the world” by constantly automating more and more human labor.

AI brings the next wave of labor automation, often referred to as software 2.0. Everybody can teach a machine now skills that were considered impossible not too long ago, the programming is not done through code anymore, but rather through positive/negative examples. Let us dive into this new programming language, which is essentially a communication method, a method that allows everybody to code very complex skills into machines without coding.

Communication Basic Principles

What is communication?

Whether it is a call on the phone, a blog post you read online, or a traffic sign that you pay attention to when driving, all communications are sharing the same main steps:

  • A goal the sender is trying to achieve.
  • Information which represents that goal.
  • A message that encodes the information for a specific noisy channel.
  • Transmission the message as a signal over the noisy channel.
  • Decoding of the encoded message by the receiver while filtering the noise.
  • Interpretation of the information by the receiver.


Here is how this generic schema looks like:


If we take “spoken communication” as an example. When we ask someone around us, “wait for a moment”, our spoken sentence will match the above schema as the following:


Whether we are having conversation with each other or browsing a web page, all our communication will follow the above schema.

Generalizing Communication

We as humans mostly communicate using our natural speaking language, which is part of the much greater communication ecosystem that you will find in nature. Here are some more broad examples to communication languages you can find in nature:

Figure 41, Poison frog,
Figure 42 Bee Dance, Jüppsche,commonswiki
Figure 43 peacock in Toronto, Benson Kua wikipedia


While animal behavior and communication are remarkable topics (and for that reason I have invested significant academic time studying it), we will skip general living things communication theories and focus on human communication.

We as humans are using all our senses to communicate:

Figure 44,


But the most popular communication channel we as human use is spoken language, while language is not something unique to humans we did evolve our linguistic communication into the most sophisticated system in nature, where the most distinguishing quality we have in compare to other species languages is the ability to have high resolution, location agnostic mass communication between millions of individuals.

Researchers have actually attributed most of the human species progress and dominance to communication progress, which allowed us to scale our progress by grouping individuals and allowing them to specialize and synchronize for achieving bigger and bolder goals that were otherwise un-achievable by smaller groups.

Indeed, each communication leap in our evolution was a trigger for technological and social leaps with it, here is a brief of these evolutionary leaps:

Speaking: High resolution, Dynamic exchange of information.

Figure 45


Drawing: Time agnostic information exchange, we can send messages generations ahead

Figure 46


Writing: Location agnostic information exchange, we now carry messages far away

Figure 47 Dead Sea Scroll — the World’s Oldest Secrets,Ken and Nyetta


Mass communication:One human can talk to millions others daily

Figure 48 Peter Small demonstrating Gutenberg printing press reconstruction, DTParker1000, flicker


Long distance and real time:Two humans can talk while being hundreds of miles away

Figure 49 Wallace Study-Telegraph,John Schanlaub from Lafayette,IN, USA


Mass real time communication :Few humans can communicate with millions others in real time

Figure 50 Pixabay, TV & Radio


Any one, Any where :All humans can talk to each other any time and anywhere


Figure 51 Nokia 9110 and Nokia 9000, wikimedia, Oldmobil


Social media: Every human can talk to millions of others, any time anywhere.

Figure 52


Throughout our history every communication leap has changed the world, AI is no different in this aspect. The fact there is nothing intelligent in machines and intelligence is biological quality does not mean the change is small. AI is essentially introducing a new communication method, bringing disruption in the form of labor automation. I tend to think about calling this new automation approach “AI” as a very successful marketing move. My take is simple, intelligence is a biological quality, evolved as part of species evolution, allowing better survivability over others, competing for the same resources. “What is intelligence?” is still an open question. I will share my best hypothesis on it by the end of this book.

AI – The Next Communication Leap

As we have seen, the communication leaps mankind had gone through history were followed by social, economic, and cultural leaps (and challenges), AI is humans’ next communication leap. Up until AI, the only folks who could program machines were software developers, the era of mass, no code programming has started.

Software developers are about 0.35% of the population and have already managed to “eat the world”. What will happen when the other 99.7% get access to machine programming?

Data Annotations – A New Programming Language is Born

In its most basic form data annotation is some marking of the data, hinting about a semantic property in the data. The annotation is essentially bounding the signal(object) in the data, separating it from the noise (environment).

In our cat analytics case, the annotation is a bounding box(pattern) marking the cat (label) object in the image.


If we would apply the same principle to text (Named entity recognition on cat like word, the cat page of Wikipedia):


During the training phase the annotation is used to communicate signals from humans to machines, in this phase the machine learns:


During the inference phase the annotation is used by the machines communicate back what it “sees”, in this phase the machine work:


Once we start looking at the AI development process as communication process many things start to make sense, we are dealing with machine teaching process with the following communication characteristics:

  • Teaching phase: Human expert teaches a bot a skill through examples.
  • Bot: A question answering machine where each one of its answers is an annotation (or group of annotations)
  • Annotation: An information unit, containing answer (or part of it) for a question and has two parts: pattern and label, The label category is the meaning of the signal, signal which is separated by the pattern coordinates from the noise, the noise is essentially the rest of the data in the example.
  • Training: A process where the machine decodes the human messages (annotations) with the goal of learning the skill from the messages.
  • Inference: A process where trained bot predicts annotations given new data, data it has not seen yet, automatically sending the messages(annotations) back to humans, reflecting its past learning from old data on new data.
  • Data: The noisy channel that enables this communication process.

The annotations are the messages embedded into the data in which the bot decodes and learns to mimic, eventually able to manifest the learned skill by supplying the annotations on new data independently to its human teachers.

The annotations are the programming language used to allow AI bots and humans to “talk” to each other, An agreed language, understandable by both parties, which allows humans and bots to exchange information.
Time to dig into this language that enables this new type of communication and allows humans and machines to exchange information and meaning. Do not get confused by the many technicalities around AI. AI development in its essence is the process of creating a smooth communication channel between humans and machines, most of the challenges of this process are communication challenges in their core.
So what does the language in which we communicate with the machines look like?

ML Pointer

You might be tempted to think that self-supervision contradicts the above and argue the future of AI is un/self-supervised learning, try to resist that temptation unless your work is defined as pure research.

Machines by themselves can automate pattern recognition and no doubt self-supervision is a great tool for capturing the patterns in the data and significantly improve the learning process, yet, the meaning of these patterns can only be assigned by humans, assignment which is changing as we as human change our views, society, rules, and behaviors over time. This change is at the core of the model’s drift. Bots cannot tell right from wrong by themselves and labeling in its most pure aspect is a right/wrong signal given to the machine, teaching it what we as humans currently accept as legit output.

The AI Language

Language definition from Wikipedia: “A language is a structured system of communication used by humans. Languages consist of spoken sounds in spoken languages or written elements in written languages. Additionally, a language may consist of other symbolic elements like hand gestures in sign languages.”
As you can see, Wikipedia got most of it right, in my opinion animals have languages as well. Any bot that learns this fact from Wikipedia, learns the wrong fact in my opinion. Luckily for me, Wikipedia also contains my perception of the world regarding that matter on a different value.
Animal language definition from Wikipedia: “Animal languages are forms of non-human animal communication that show similarities to human language. Animals communicate by using a variety of signs such as sounds or movements. Such signing may be considered complex enough to be called a form of language if the inventory of signs is large, the signs are relatively arbitrary, and the animals seem to produce them with a degree of volition (as opposed to relatively automatic conditioned behaviors or unconditioned instincts, usually including facial expressions). In experimental tests, animal communication may also be evidenced through the use of lexigrams (as used by chimpanzees and bonobos).”

Wikipedia contradicts itself, a bot answering the question “Do animals have language” and using it as reference to provide an answer will find it impossible to get it right, Wikipedia itself does not agree on the “truth” here.
The above exposes a big issue of our era, how can we tell true vs fake information and how do algorithms (AI) decide if a piece of information is legit? A very challenging topic which is outside our scope, but how can we expect not to speak the truth when we as its teachers don’t agree on it?

We will focus on how language delivers information and leave the truthfulness question of that information to others. In our simple case, we have a good consensus on what a cat is.

Language consists of 3 parts:

  • Signs – Words, sounds, symbols
  • Meaning – How signs are interpreted to “things” in the world.
  • Code – The set of conventions and rules used to communicate meaning using signs.


The vocabulary is the collection of the signs in each language defined and the grammar (or syntax) is the code or the rules in which an entity can assemble complex stories (sentences) from simple signs(words). Sentences allow us to transfer more information using less signs, our language is a natural information compression algorithm, designed for the human physiology, environment, and needs.

When it comes to AI, we use the same vocabulary and meaning we already have in our native language to describe the world around us to the machines, however the code in which we deliver these words or pieces of information is very different, data annotations are sentences describing the data, sentences which are generated during the labeling process. If you think about sign language, this language is designed to exchange the same information as spoken language, yet without voice or text. Data annotations are a kind of sign’s language that AI bots understand.


The scientific study of languages (linguistics) is vast and greatly simplified here, read further on utterances, morphemes, words, and affixes to learn in more depth about the structure of languages.

The AI Sentences

The sentence is a basic information unit (or message) and most languages code define core components in each sentence: Subject, Verb, and Object.

Different languages’ grammar typologies define different order between these components inside a sentence.

Here is a list of different linguistic typologies ordering in different languages, we use the sentence “The cat has a collar”, as a reference.


We choose the Subject-Verb-Object(SVO) linguistic typology going forward, a sentence structure where the subject comes first (“The cat”), the verb second (“has”), and the object third (“a collar”).

Often people working on AI think about annotations as some technical markings need to be completed, annotations are the sentences we use when we teach our bot something.

If to imagine the annotation as the sentence we teach our bot, it has a Subject, Verb and Object:

  • The label of our annotation is the subject of the sentence.
  • The attributes of our annotation are the verb/objects of the sentence.

While it looks very theoretical, the above has practical takeaways, for example: we don’t expect annotations with multiple labels, just like we don’t expect a sentence with multiple subjects.

What is the vocabulary and relationships used in these sentences? Ontology is the word list our model is familiar with.


In some cases, the term Taxonomy is used, Taxonomy is the Subject part, i.e., the objects we recognize, Ontology is more complex than Taxonomy, also stating relationships between entities:

The sentence “Flying cat” for example is invalid in our world, the attribute (or verb) “flying” can not be used with the subject “cat”.

Ontologies, The Vocabulary

“There are only two hard things in Computer Science: cache invalidation and naming things”, attributed to Phil Karlton

What is ontology? Naming things.

Ontology is a branch of philosophy (metaphysics) and can roughly be described as “the connection of things “, when you are developing AI you need to pay attention to the philosophical way you describe the world, naming things was important in software 1.0 and becomes even more important in software 2.0.

While ontology is the professional term for the categories we use, you might still encounter different names for it like “label maps”, “class list”, “dictionaries” or “categories”, ontology means these words are more than just a collection of words, it implies the entities we have chosen in that word collection have a clear description, relationships and allowing us to classify each entity to its proper, well-defined category in the context of our purpose.

Let us see example of simple ontology:

To get the philosophical aspect, try to answer: Where will you put “things made out of wood” in the above ontology tree? If the developer of the above ontology ever must answer this question, they will encounter data ambiguity issues. As of today, the ontology architecture does not get the attention it deserves from the AI community and a lot of the pains involved developing AI start with ontology errors. I would expect that to change and expect a new AI profession to evolve in the coming years: ontology architect.

Ontologies are the basic words we use when we classify things into categories.

Many domains have already predefined ontologies as part of their scientific progress:

  • Taxonomy – The ontology of living things
  • Nosology – The ontology of diseases
  • Wine tasting descriptors – The ontology of wine types.


Whenever you talk with an expert of something, one of the things you will notice is that this expert is deeply familiar with his domain ontology (or jargon) and when you see a cat, he will see Chartreux (blue French cat).
So how do we construct our cat ontology? For now, we have a single word in our vocabulary, “cat”. Since we already know, we will need to support more cat types and other characteristics like collar and neutering. We can expect our single word AI to have more words in its ontology in the future.

How many words should we expect?

The basic vocabulary of Hebrew has about 50,000 words and English has about 200,000 words vocabulary and this gives us an estimation of the ontology size used by intelligent beings. Note the basic vocabulary excludes product names, professional terminologies, or slang, which can take the vocabulary size 10X for each language.

The ontology we use in our AI is a compression of the world representation. Since we are using a model which knows the single world, our model is compressing the world into cat and no-cat things.

Our cat detector prototype has one word in it: Cat. If we want to recognize a Siamese cat or Persian cat then we have just expanded our vocabulary to 2 words.

But what if we want to detect all types of cats?

How many cat types are there? depends on who you ask, and the philosophical cat ontology discussions are being debated by cat official associations:


See how three different groups of global cat experts find it hard to agree how many cat types there are (undefined truth again), something we thought to be trivial. This is normally the case when you construct ontology with deep expertise, you encounter edge cases and ambiguities and eventually, you choose how you describe the world, your ontology choices will be learned by your model and introduce the roots of its bias.
Let me emphasize the above once again, before you collected any data or even developed a model you have already introduced bias into the future result, bias that came from the way you choose to view and describe the world, this bias will exist not matter what model you choose to work with and the only way to fix it is by changing the ontology. The AI bias roots are in the brains of its creators, born before the first data example was even collected.

To avoid the philosophical debates and keep ourselves pragmatic with customer’s needs, we consult with a cat expert, the cat expert teaches us that these 20 types of cats are the most common and we should be fine supporting them:



The customer approves the 20-cat list we have suggested, implicitly and silently agreeing with us to bias out dozens of cat breeds that we will not support.

Ontologies play an important factor for our data volume planning and accuracy, the larger the ontology the more expensive, slow, and in-accurate our AI model will be, we will cover this in our next chapter on data planning in more detail.

Ontology Namespace Notation – Dot Separation

We view ontology as a family tree where each child node has an “is A” relationship with its parent node. The common case of ontology is the label list, where we do not declare relationships between the different entities in that list, just a simple flat list of options to choose from.

Let us take a basic example, 3 types of cat detector ontology can look like that:


Since ontology has a tree-like structure, we will mark a class (or label) as a dot separated label, where the parts are defining a path on the ontology tree. So, for the above ontology all of the following notations are valid classifications: cat, cat.sphynx, cat.siamese, cat.persian.

This notation is both software friendly (simple string parsing, storing, searching), allows tree like description and allow us partial annotation, i.e. I can see it is a cat, just not sure which type of a cat.


While labels are used to define the subject, it is common to use attributes to represent the object and/or verb.

we can now create any label-attribute combination of our sentence if it makes sense, if we were to add attributes to our ontology like standing, sitting, running & collar:


Now we can construct complete sentences like these:

  • The running sphynx has collar: cat.sphynx, status.running, collar.yes
  • The Persian cat is sleeping: cat.persian,status.sleeping


How many different sentences can represent?

The ontology size is the number of unique sentences we can generate using our ontology, the above ontology can be used to generate 18 different sentences using the different label/attributes combinations.

The Recipe

We are no longer a cat/no cat model. Our model now needs to say what cat type it sees, and our data labelers have no idea how to properly classify these 20 types, entering the recipe.

The recipe is the data annotation instructions given by the domain expert to the data labeler, narrowing the expertise to a simple question – how do I tell if it’s a persian cat?

The recipe uses human-readable instruction that explains to the data labeler how to label a given data example, the more expert our AI becomes the more critical our recipe becomes. Here is our AI knowledge flow again, this time with the context of the recipe:


The recipe is the translation instructions between human natural language and the AI language, the data labeler is the translator, getting as input the full knowledge complexity and outputting very clear annotation according to it:

Often I see AI teams making the mistake of underestimating the importance of data labelers talent and impact. These folks are the developers of the AI era that are getting requirements and delivering programs and investing in the documentation and training of these workers is critical. If you think about this process as AI programming: Would you discount the programmers impact and talent on your software quality?

The Recipe is a Prerequisite Data to Quality

The recipe is the instructions list our data labelers are using to label the cat type properly. It contains enough examples, rules, and explanations to allow our data labelers to learn how to label a cat type as if they were cat experts themselves. A lot of human learning and education takes place behind the scenes of AI development, where domain experts constantly refine the instructions and try to prevent mistakes and ambiguities.

The AI narrowness plays an important role in the recipe as well. It takes years of learning to become an expert in any field (cats included), but most people will be able to tell the difference between Siamese and Sphynx cats within an hour or two of training. Narrow AI is narrow information transfer from top field experts all the way to machine learning models, the narrowness is what allows small errors on this communication flow, both in the human-to-human communication (recipe) and the human-to-machine communication (annotations).

A good recipe allows clean and fast data labelers training, producing accurate, high-quality data.

Recipe – Significant Cost Reduction Opportunity

Domain experts are hard to find, have very limited time, and usually very expensive. AI development activity needs a constant supply of labeled data, produced by dozens to thousands of people. A clear and accurate recipe is enabling fast and accurate knowledge transfer between humans with different levels of expertise and cost, eventually leading to much faster labeling speeds while taking labeled data cost down. If complex annotation requiring expertise can be broken to two smaller annotations which one doesn’t require expertise, break the single task to two.

The recipe exposes the need for human training system as part of our machine learning tools, this training system properties:

  1. Training mechanism – teach the humans the skill.
  2. Qualification – test the labelers who are qualified to do the job and understand the instructions.
  3. Scoring – ongoing scoring of results by a human labeler.
  4. Example Referencing – while labeling, have a library of references, “what if/how-to” examples to be gathered and available to the labeler to refresh.
  5. Feedback – the ability to raise issues/questions on bad labeling or corner cases by experts to data labelers and vice versa.

Ontology Evolves and Changes With Time

Expect the recipe and ontologies to change and evolve over time, just like software has releases and versions. In early model life the recipe instruction definition changes and updates frequently as we add more data and improve the modeling for better results, once the labeling instructions (and therefore the model architecture) stabilize we can start scaling. Scaling a non-stable recipe is a painful and expensive process, you will know your recipe is stable once you land a new customer or location and the recipe and ontology will not change, and you will reach your previous accuracy after some iterations of data collect-label-train, using the existing recipe (no updates).

Stable recipe and ontology will likely hold as long as product requirements or environment don’t change, yet eventually they always tend to change to match new customers, features, and environments. We will review how that is managed when discussing ontology forks.

With our labeling workforce ready and recipes properly defined, we are ready for the next scale of our cat analytics.

Next Chapters

Chapter 9

We are preparing to launch our AI app. We have basic models that are functional, we agreed with the pilot customer for a calibration period that allows our models to adjust to the fresh data and the data interfaces (APIs) with customers have been defined. In this chapter we will dive into the preparation and planning needed for launching and scaling our app deployment.