This is the first part of our “Computer Vision Use Case Breakdown series.” In this part, I’ll discuss the different methods for knowledge sharing in computer vision. Be sure to stay tuned for our second post where I’ll present examples of questions people want machines to answer in real-life scenarios and more complex scenarios you can tackle using deep learning and how to break down complex cases into simpler and more efficient tasks entitled: Uncovering AI Tactics For Solving Real-Life Problems.
Building effective AI models is synonymous with teaching computers to learn like a human. Traditionally, computer vision perception has forced engineers to select which features to look for in an image in order to identify a certain object to teach the computer. They’ve also been accustomed to selecting corresponding sets of features for each class of objects. As the number of classes increases, this method quickly becomes increasingly complex. The color, edges, and contrast are just a few aspects to consider but they don’t account for many parameters that have to be manually fine-tuned by the engineer.
On the other hand, deep learning makes use of the concept of “end-to-end learning.” When deep learning is used, the algorithm is simply instructed to learn what to look for in each specific class. It automatically determines the most prominent and descriptive features for each class by studying sample images. The difference in the results is huge: 70% accuracy in traditional methods was a good result while 85% accuracy using deep learning methods is quite common.
I’m going to address high-level information about how to teach computers to learn like a human, which I’ll cover in a number of posts. I’ll uncover example use cases, discuss why variance matters so much and address complexities in the AI industry and how to break them down to simpler tasks that will ultimately make it easy for you to teach your machines to learn like humans. So let’s get started!
Baby Steps: Teaching The Model
When we build an AI model we are actually building a communication system between human knowledge and a machine. We assume that the machine knows nothing (literally nothing) and your only way to teach it is by feeding it examples. The machine’s only goal is to learn the mission you need it to learn (neural networks’ final layers make the decision based on the patterns learned in the neural net from examples).
Let’s take a simple example – teaching a toddler (smarter than an untrained neural network). If you need to teach them what a bird is, you will avoid showing pictures that don’t predominantly contain birds in the image. If the bird is in the picture but there are other elements, in most cases, you would point at the bird to “focus” the toddler on the relevant object you want them to learn. With that said, the toddler still needs to understand that a bird can be in the sky, on the ground, or partially hidden by a tree.
Let’s take another, slightly more complex example. Imagine someone is trying to teach you a new concept called “padada.” You have no idea what it means, and all you have are pictures containing 1000 other elements you do not know. You might be able to pick it up by trying to identify the similarities between images and assume that if an image contains similar traits, they will be translated as “padada,” but it will be extremely difficult to decipher whether you actually learned what “padada” is.
If your instructor were to point at the image and tell you what is “padada,” you would naturally understand the concept faster.
Using the above example we need to focus on a few basic steps:
- Defining the question correctly and making sure it is “simple” enough to understand (see examples below)
- Feed enough samples that represent the “real world” and as many edge cases as possible (in order to maximize variance)
- Make sure to feed answers that capture the required information without changing the question (e.g. in many cases, classification can’t answer detection questions since saying this image contains “bird” doesn’t help say where or how many)
The below examples are the 4 most common ways to transfer the information in the image.
Classification
Label of the entire image = bird

Bounding Box
Box territory is “bird” (by default the rest is noise that we need to learn to ignore):

Polygon
The green territory is “bird” (by default the rest is noise we need to learn to ignore):

Semantic Segmentation
Green pixels are of the class “bird”
Purple pixels are of the class “flower”
Orange pixels are of the class “background”

The above methods progress from the least amount of information to the most amount of information provided to the model. The first couple of examples feeds the machine a lot of noise while the latter examples try to minimize the noise and hone in on the primary information that needs to be learned.
The SNR (Signal to Noise Ratio) is the common formula that compares the level of the desired signal out of the entire label. In practice, it means “how much information from the sample I got is important for me to understand?”
There are 2 main considerations regarding SNR values we try to optimize:
- The ratio between actual signal to the size of the label:
- Lower than 0.5% = Unreliable and can’t be used in your given industry
- 0.5%-1% = Can work only with huge amounts of data and compute
- 1%-10% = Should work in most cases
- More than 10% = Easy for the model to learn
- Differentiation of the label from similar patterns in the data –
- We want to avoid marking labels that can confuse the model. For example, if we mark a box around a 12-pixel car which is all black dots we might accidentally teach the model that 12 black pixels are actually a car
So why shouldn’t we always just get the most information?
Simply, because it has its costs and usually the more information you provide, the more you need to invest (time = money). The bird classification in the above example can take 1-4 seconds to complete while the semantic segmentation can take hours if your images are complex (multiply it by thousands of images).
Therefore, in order to keep our mission cost-effective, we should find a trade-off: we need to find the right tool (classification/detection / detection-classification / pixel-level semantic segmentation). In many cases using polygons will not enhance the results of a simple bounding box mission but it will be far more expensive.
Summary
Deep learning teaches a computer to solve a riddle. In order to make the learning process effective, we need to feed a variety of examples while enhancing the value coming from each sample.
It is important to keep in mind the fact that the computer only knows what you teach it. Therefore, you need to validate and optimize the signal in order to capture the variance in your data.
Stay tuned for our next blog post, where I will feature samples from multiple domains across varying industries. We will also discuss how to validate your use case and what factors you should be considering for your use.