pipelines blog

3 Ways to Accelerate Your Workflow Using Data Pipelines

A few months ago, I stumbled across this image (below) on social media and I could just imagine the long and tiring conversations between the business and development teams.

Business teams often exaggerate what AI can do for their organizations, thinking it can work instantly. But preparing messy and unstructured datasets is a time-consuming process. It’s no wonder that 87% of data science projects never make it into production. While the statistics seem alarming there are a couple of approaches that can reduce your data processing costs, one of them being custom data automation pipelines. Dataloop’s Pipelines help process data in a pipeline architecture. They consist of a series of nodes.Each node’s output is the input of the next one, which can be triggered by different events. The pipeline process allows transitioning data between different nodes such as labeling tasks, FaaS (functions as a service), code snippets, ML models, and more…

Dataloop’s pipelines can:

  • Facilitate any production pipeline
  • Pre-process data and label it
  • Automate operations using applications and models
  • Post-process data and train models of any type or scale at the highest performance and availability standards

In this blog post, I’ll go over 3 pipelines that can help you spend more of your time developing high-performance computer vision applications.

Pipeline #1: Microtasks

A complex dataset results in a complex task that leads to human mistakes. One approach to simplify this is by dividing the task into micro-tasks which allows the annotators to focus on specific labels, complete annotation tasks quickly, and send the data to the next task in line or in parallel for multiple tasks. This can come in handy when you have a lot of objects from different labels, different annotators with different specialties, or just extra QA at different times and from different places.

Let’s say you need to analyze the image of these cells:

Here, you have 3 types of cells.

Three medical experts might need to annotate the same image separately and provide their own input.

The first expert annotates the first cell type (pink).

The second expert annotates the second cell type (blue).

So, instead of sending each task separately and then manually combining them into the same image, you can rather automate the process with a customized pipeline.

The third expert annotates the last area (green).

So, instead of sending each task separately and then manually combining them into the same image, you can rather automate the process with a customized pipeline.

For this micro-task process, we have created a template that can be easily implemented in every project, more about how to do this can be found in our documentation here

Pipeline #2: Trimming

The original data is never in the exact form for annotating, especially when dealing with video tasks. It could be, for example, hours and hours of CCTV footage that is impossible for any human to annotate. So, most decide to split the videos into smaller videos or frames. This improves an annotator’s productivity, allows task progress visibility and lets multiple annotators work on the same video. 

Our most requested pipeline is the custom video trimming that automatically trims a large video into smaller videos, sends them to tasks, and stitches the annotations from all the trimmed short videos into a single one.

Let’s overview the 5 components.

1. Videos Upload Dataset

The first component is the dataset in which all original videos are uploaded.

You can add a trigger that will automatically begin the pipeline for every uploaded video in the dataset, simplifying your process.

2. Video Trimming by Frame

The second component of the pipeline is the FaaS function which splits the large video by frames or seconds. More about FaaS here.

3. Training Results Report

The third component of the pipeline generates a report that assists with monitoring.

The report compares the expected FPS and the actual FPS and determines if the trim was successful. So, only if the FPS of the trimmed videos matches the FPS of the original video the pipeline will continue.

4. Video Task

The fourth component of the pipeline sends the created videos automatically as a task to all or selected annotators in the project. Once the annotators mark the item as completed it will move onto the next component.

5. Video Annotation Stitching

The fifth and final component of the pipeline exports the annotations out of the trimmed videos and stitches them so one Dataloop JSON annotations file like they came from a single large video. This JSON file can be later converted to any format that your model expects.

Using the following process you can quickly export annotations without the need to manually split each video and stitch their annotations. This also helps to track tasks and achieve better results thereby helping annotators focus on smaller chunks of videos at a time.

Pipeline #3: Auto-Annotation

Let’s imagine a situation where you are developing your model and it’s not getting  the desired confidence level, you’ll then need to train it with more annotated data. It seems really time-consuming to annotate another batch of data from the beginning. This is where my favorite pipeline,the auto annotation really comes in handy and helps optimize this process.

Let’s take a look at the components:

1. Auto Annotation Dataset

The first component is the dataset in which all of the original data is uploaded.

You can add a trigger that will automatically begin the pipeline for every uploaded data in the dataset, thereby simplifying your process.

2. Update Metadata

The second component of the pipeline is the function that updates an item’s metadata that allows the pipeline to help you easily track the data across the pipeline. This is optional, you can assign any metadata that you want, or skip this step entirely.

3. Detect Function

The third component of the pipeline is where your model annotates the data.

We use a FaaS component to execute the function. More about FaaS here.

In this example, we are detecting human faces from images using the Caffe model.

More explanation about the full code can be found on the following documentation page.

detect function

4. Validate Result

The fourth component of the pipeline sends the tagged data automatically as a task to the project’s annotators. Each annotator will adjust the labels of the annotations to either “good” or “bad” to validate the model’s results.

Good Bad

5. Add Metadata

The fifth component of the pipeline is the function that updates an item’s metadata that it has ended the pipeline.

5.5 Filter

The filter component of the pipeline makes sure that the relevant items move to the final dataset. We create the filter using DQL which is the Dataloop query language, more about DQL here.

Here, it filters annotated items that have a label with the class “bad”, they are those that are relevant for the retraining of the model.

DQL Filter

Confidence level filtering option

Another filtering option that can be added is the annotation confidence level filtering. This filter, if added before the human task (step 4) can help to make it easier for the annotator to annotate relevant data. In this flow, the items that didn’t pass this confidence filter then go to a dataset for corrupted data, allowing you to easily identify which data rendered bad values from the model.

6. “Bad” Labeled Data

The sixth and final component of the pipeline is the dataset that summarizes all of the “bad” labeled data in one place. You can now try and see if there is a connection between all of the faulty annotations. For example, they could all come from a single device. So a quick filter would be by custom user metadata of device types.

In this example, you’ll get items with a JPG extension and the device type is IOS.

bad labeled data

When reusing existing elements and creating a loop you can deliver quality annotated data to your model. After implementing this pipeline all you’ll need to do is add new data to your selected dataset, it will automatically pick up the new data, create the task and teach the machine the relevant results.

Summed Up

Using data pipelines for your computer vision project can reduce a lot of the friction involved.

It can automate, clean, and customize your process in order to help you focus on improving your model. When you reduce human involvement and replace it with automation, your data costs can be reduced. Isn’t that what AI is all about?

If you’d like to discuss how we can help assist you build a customized pipeline for your project, then set a time that is good for you.

Share this post

Share on facebook
Share on twitter
Share on linkedin

Related Articles