data loop

The “Data Loop”

I recently came across an article in Forbes that really struck a chord with me. It boldly stated that we’re going to be hit with a “wave of billion-dollar computer vision startups.”  This immediately brought me back to something I discussed several years ago with my colleagues in university. I claimed that the first step to human-automation was in the Industrial Revolution.

Powering the Fourth Industrial Revolution

We’re currently going through the Fourth Industrial Revolution. Mankind has invented new ways to automate processes – in order to enable people to focus on new tasks, more complex as well as less technical while accelerating the path to higher velocity outcomes. Replacing human capabilities enables us to create solutions for tasks that contain:

  • Physical labor (engine and automatic assembly lines)
  • Cognition (the computer)
  • Senses such as audio and vision (deep learning)
Image for post
A solution that uses human cognition, vision, and mechanic automations

My views are shared by Eran Shlomo, our CEO at Dataloop. He states that artificial intelligence is powering the Fourth Industrial Revolution and that companies across verticals are competing to capitalize on its promise. In fact, in a 2020 survey of C-suite executives at leading firms it was revealed that more than 90 percent are investing in AI – yet less than 15% “have deployed AI capabilities into widespread production.” While 15 percent is a respectable number, it’s important to note that out of those 15 percent who state that they are in “widespread” production – many haven’t been able to actually scale their activity.  

The Road To Production Is Bumpy 

The challenges that the human race has had to overcome to enable computer vision with deep learning are in the past. Now, we have a brand new set of challenges that every company has to tackle on their journey to production.


The term “garbage in, garbage out” is a common term in data science. A company that wants its model to have good results needs its ground truth to be accurate. High-quality data improves the odds to reach top model accuracy while saving significant labeling resources. Models must be constantly fed with accurate data to keep and increase their confidence level. 

Different use cases suffer from different consequences to mistakes, even within the same segments. The cost of error when selecting only ripe fruits is small (a non-ripe fruit gets to the customer) while not spotting a disease in plants can have catastrophic results (the entire field needs to be discarded). These situations we’re talking about are critical to the business and can incur significant loss. However, when it comes to medicine or autonomous vehicles, it could literally be the difference between life or death. 

When the cost of error is so high, it can have a critical impact on your business. This further reinforces the need to stride for the highest accuracy possible. 


Creating the first iteration of your solution is hard, but it’s only the starting point to a much greater goal –  scaling up. When you scale up you tackle three main issues:

  1. Sources in various environments: The more widespread your coverage is, the more data sources you’ll have. That means increasing physical items (such as cameras) which you’ll need to manage while getting higher variance in your data because of the different environments.
  2. Data: Scaling up means that you grow your data by the number of deployments. If you only had 3 installations, and now you have 3,000 – then your data size increases  X1000, per time period. The main issue is that what used to be an annotation task for 5 people becomes an annotation task for 5,000 people. Every time a human has to touch the data, that translates to higher expenses for the company. For that reason, “data pipelines” become critical. It reduces the amount of data handled by humans, and in the process, it enables humans to verify only low-confidence model predictions. 
  3. Models iterations: The model handles data with much higher variance. In most cases, that will result in a lower success rate for the model and will require iterations of model improvements (or active learning) that will include the new data inside. In order to improve your results, you need to get a good visualization of your data and look for weak spots such as not enough variance in the background, or new environments, that have new or unexpected elements, etc.


Going to production is usually an important step for companies aiming to scale (typically by getting more customers).  You invest money in your product and you start earning money, but it doesn’t stop there. You’re still going to have growing costs such as: 

  1. Production of data sources: You need to produce even bigger numbers than before.
  2. Data storage: An autonomous vehicle creates 4 TB of data per day. Multiply this number by the number of cars you get in production, and you can understand how big the numbers can escalate. 95% of companies face this issue in their scaling-up phase (they need to use production data to keep improving the model).
  3. Annotators:  The annotators in your company (external or internal) are crucial. The work of translating data into an accepted language (model ontology) becomes a key part of any computer vision project in the world. By failing to create smart mechanisms of data pipelines you risk accumulating huge costs when scaling up due to the fact that the data is continually growing by huge multipliers. 
The difference between a pre-production to production workflow

Create Your Own Dataloop

There are four critical tool sets you’ll need in order to move to production. Most companies have at least 1 of the 4, but only a small percentage have all four. The farther you are in your journey, the more likely you’ll have a need for the tools.

  1. Annotation tool — This is the first step in creating a computer vision model. It doesn’t matter if you do classification, detection, or segmentation; if you create a computer vision model, you’ll need to annotate your data.
  2. Workforce management — You can use internal, external, or hybrid data annotation teams. Either way, you need to assign tasks, run QA processes, and get analytics of your annotation team’s performance.
  3. Data operations — The more your business is dependent on data, the more you need to understand it. These days you must be able to sort, filter, create versions, and visualize your data in order to get its real benefits.
  4. Data Pipelines — This last step is usually the hardest. It basically means you need to take all the previous steps and combine them with your deep learning models using code-based triggers. Basically, you want to pre-run your model over the data, filter only low confidence results, open a task with them to human annotators, and after QA send the corrected results as new data for the model. It might sound simple but even simple pipelines can get very complex. Video anonymization, for example, needs to have data uploaded, video to frames, face detection, face anonymization, and video assembly.

A Real Life Example

One of Dataloop’s customers is a leading provider of automotive technology. Up until now, I’ve outlined the challenges, but here’s the part where we can get a glimpse at the solution to these challenges. The 3 main areas we focused on include:

  1. Training annotators
    1. Automatic qualification exams
    2. Automatic work assignment and redistribution 
    3. Annotator scoring based on results
  2. Data workflow automation
    1. Initial model inference 
    2. Data anonymization (face and license plate)
    3. Detect frame relevance to reduce manual tasks
    4. Frames to video
  3. Reporting and quality control
    1. Consensus of annotators
    2. Compare annotators to model results with automatic issues (IoU, auto validation, smart QA)
    3. Daily performance and quality analytics 

By using Dataloop our customer reduced around 90% of data reaching human annotators. There was also an average increase of 1200% processed data items, all while ensuring high-quality data. This ultimately translates to a lot of money saved, both in the short and long period. 

Automation allows for more scalable and accurate data preparation workflows. Utilizing automatic annotations boosts the manual annotation processes with the help of these automation tools. In addition, with auto-annotations, you’re able to reduce manual work to minimal editing. 

Customer Pipeline in High level

Summed Up

The “human-in-the-loop” approach is the method to scale up while improving your model without going bankrupt along the way. You need to make sure that you get more high-quality data, and improve your model when scaling up, by planning ahead and automating your data flow processes.

Join the Fourth Industrial Revolution, and create your own Data Loop.

Share this post

Share on facebook
Share on twitter
Share on linkedin

Related Articles