We recently sponsored the Computer Vision Festival, an awesome event where stakeholders in the AI and Machine Learning communities could come together, share, network and learn from one another.
As part of the event, AI industry expert, and our very own CEO, Eran Shlomo, shared his thoughts on a new approach to guaranteeing successful machine learning projects. Check out the full session here, or read on for some of our highlights, and how they extrapolate to multiple industries, today.
The Changing Face of Machine Learning
Traditionally, if you wanted to generate an algorithm or a form of cognitive automation, you would first need a domain expert. This domain expert would sit with machine learning experts, and together they would analyze the important features, before coding this into something machine friendly.
Today, in the era of the neural network, the whole process is being changed, for the better. We now take the features, label them, and then train a machine learning model. The benefits and opportunities of this shift are huge. It becomes more of a ‘cookie cutter’ process, which means it’s easier to repeat, faster, more accurate, cheaper, and easier to scale.
The results are varied though; machine learning is still in its infancy. So, how can you get the most out of the new paradigm?
Keeping Accuracy Levels High through Narrow Neural Networks
Neural networks can do a lot, but only if they remain narrow enough. This is directly related to ontology size, the class list size for any given model. Think about it: Of course, accuracy is of the utmost importance. As the size of your list grows, the model becomes less able to perceive optimally. This is where you will need to split the model, to keep the accuracy levels at their peak.
Let’s look at a real-world example, from the automotive industry. Imagine that you want to create a machine learning model for reading traffic signs, for autonomous vehicles. If you ask a model to identify the existence of specific traffic signs, and you ask it to parse the words that are written on it, you’re going to end up with too much information for a single model.
Instead, by splitting the model, you can have one for the text, and one for the sign itself, and your accuracy levels will be a lot higher. Of course, you may need to split the model again, for example by language of the text, or by location of the traffic signs. That’s fine, any machine learning initiative will be an amalgamation of a number of models, layered together.
Focusing on a Lot of Information, Rather than a Lot of Data
One of the main things to remember when you’re training a model, is that it’s not about having enough data, it’s about having enough information. In the agriculture industry for example, you could have a million photos of the same crop, but that wouldn’t help you with any other produce, outside of that specific use case. What you need, is a million examples of variances within that type of crop. Context is incredibly important here.
If your dataset isn’t diverse enough, and you have thousands of images of the same type, (eg: 1000 red cars, or 1000 of the same breed of dog) you aren’t providing the model with any new information. This is why edge cases can so easily become the dominant factor when it comes to scaling your machine learning model. Collecting information for machine learning initiatives will follow the law of diminishing returns. As you collect more data, you need to work a lot harder to find more actual information. In every 1000 images, only 1 or 2 will count as edge cases, rare images that provide new details to train the model. That means that if you’re not careful, a new dataset of 1000 images will actually add very little to the improvement of your model, and why it’s so important that you’re choosing images that add information to the training set. This is one of the main reasons why scaling the labelling process is both difficult and expensive, and why companies need to find a way to select content that has a high level of information, and doesn’t simply add mass amounts of data. Let’s say your company has a budget for 10,000 images – how can you get the most out of that budget, choosing the images that will help your model to learn?
As you collaborate with more people to get more edge cases, you’re also adding new challenges, such as quality assurance, category balancing, and the ability to differentiate between people and their subjective opinions. Let’s go back to the agricultural industry for a moment. One person might say that a particular fruit or vegetable is ripe in a specific image, while another person might say the produce is underripe. A lot of human labelling will come down to opinion. This isn’t the only problem. In many cases, domain experts will be rare, so how do we teach labellers? How do we make sure that all of these experts are aligned? How can we communicate new instructions, many of which change every day, to thousands of people? These are the real-world problems that we’re dealing with as we improve our training models.
The sheer labor involved in manual labelling processes is also prohibitive. Think about a problem such as schema migration, using the automotive industry as an example. Let’s say you want to use a million images of cars to train a new model, and you need to mark whether the wing mirrors are visible. You’ll need to go back and label each one again, with this new schema.
In the real world, machine learning models need to be able to continuously evolve and adapt. Think about Retail AI for example. The shelf content of a supermarket is a dynamic reality, changing by 20%-30% every quarter, with updated packaging considerations, new products, seasonal produce, and more. Drifts, biases, and edge cases are a guarantee.
Moving the Model from the Lab to the Real World
Ignoring the realities of the neural network era can lead to grave mistakes, bottlenecks, and failures for companies who are looking to leverage machine learning. Let’s look at one of the most high-profile examples, Google Labs.
In 2016, Google Labs AI announced that they had a medical AI model that could perform at human expertise level in identifying diabetes. The company announced that their model showed 90% accuracy, and started a pilot in Thailand, a country that had 200 experts for a population of 5 million. They showed that AI would be able to diagnose in 10 minutes what a human would take between 2-10 weeks to identify. It was the perfect use case.
Or, was it?
First off, it’s important to recognize how long it took Google to move from the lab to the pilot, and from the pilot to the results. Both of these steps took 2 years to complete, an eternity in today’s fast-paced technological landscape. Secondly, the results themselves were shocking. For a number of reasons, from low lighting, low quality images, unexpected variance and poor connectivity, to the rejection of 1/5 of the images – the pilot was a failure. Google Labs had been unable to move the model from the lab to the real world. What appeared as 90% accuracy in the lab, was far from it in real life, when they were testing the model on actual cases.
Many companies make similar mistakes, which can be described as treating a machine learning model as if it’s a piece of software. Deep learning models just don’t behave like software, and without continuous learning, they are likely to fail.
Unlike a software flow, a training model flow allows organizations to get a machine learning initiative ready for the real world, handling issues such as edge cases, labelling processes, precision, recall, accuracy, and an alignment with your specific business needs.
Dataloop is an industry-leading example of today’s intelligent data management platforms . These spend less time on data preparation without the important addition of validation, and offer the ability to continually test, train and optimize, using real-world data from production.
Continuous Learning: The Missing Puzzle Piece
At Dataloop, we offer an all-in-one data management platform for everything from labelling to data pipelines and beyond. We add automation where it can reduce error rates, encourage repeatability, and speed up your ROI, and we include the ‘human in the loop’ where expert validation needs to be intertwined for business growth and success.
With continuous validation from your production environment, your organization gets a machine learning model that is constantly learning from its own behavior, and behaves the same in the wild, as it does in the lab.
Want to see how it works for yourself? Let’s schedule a demo.