3 Annotation Tips to Reduce QA

3 Annotation Tips To Improve Data QA Processes

As an annotator, you’re expected to follow a data labeling workflow process, and this may differ from company to company. I’d like to share some insights and lessons I’ve amassed throughout my time as a Data Operations Specialist at Dataloop, closely observing and guiding data labeling vendors on how to utilize our platform .

A common workflow for a data labeling project owner is to first define the labeling instructions (recipe & ontology), and then assign annotators from the team to label the data according to the given instructions.

The next natural step is for the annotation team to review the data quality by reviewing and correcting erroneous annotations. Finally, the data is submitted back to the project owner, for their review.

Common Annotation Workflow

Define labeling instructions –> Annotate the data –> QA the data –> Submit the data

This workflow creates a common problem; typically, during the QA flow, you’re likely to consume up to 40% of the annotation time. It’s almost as if you’re annotating the data a second time, which is inefficient and wastes a lot of unnecessary time.

So how can you decrease the duration of the QA step? First, you’ll need to change the workflow.

Changing the above workflow will simplify the instructions and ensure your annotators have clear annotation instructions from the get go.

Basically, annotating an object is simple. The challenge is in how to relay clearly to the annotator what object to annotate.

Let’s review an annotation workflow example

In the below image, the labeling instructions for the project are: classify flowers in the dataset you’re labeling.

This image contains flowers, but they aren’t real. This presents a conflict for the annotators – should they or shouldn’t they label the flowers?

I can share that the original purpose of this project was to annotate real flowers but not to annotate synthetic flowers. Therefore, labeling the above image as “flowers” would be incorrect.

If an annotator’s dataset includes a large share of synthetic flowers but there are no instructions on what flower to include in the classification (synthetic or real), then the annotator won’t know how to exclude items from their classification flow. This will create a bottleneck during the QA process, where irrelevant items would be included because of unclear instructions.

All this could be avoided if the instructions are accurate and detailed.

I recommend the following instructions for improving the annotation flow and reducing time spent on QA:

Annotate [classify] real flowers. Do not annotate [classify] synthetic flowers.

This leads me to a number of lessons I’ve learned in order to reduce the QA process.

Lessons learned to reduce annotation QA

Lesson #1: Improve the Instruction Flow

Precise instructions can decrease ambiguity, leading to less QA time. In addition, there are another 2 goals that can be achieved:

  • Generating qualified data and
  • Improving labeling time and efficiency

Lesson #2: Consider Demographic & Geo Differences

Locations and culture gaps have a direct influence on the annotations.
Annotators are located all over the world, many of whom are based in remote locations Due to the fact that annotators are located all over the world, many of whom live in remote locations that have different cultural nuances, affecting their ability to understand specific project guideline?

It’s only natural that annotators face language gaps if they are not native to the project’s origin.

Furthermore, it’s impossible to expect a person to read 30 pages of instruction and be super attentive to details.

This human limitation could be easily solved by including a few training steps that will ultimately improve the QA process.

Training Steps

For effective training, I recommend using the following workflow:

  1. Project manager creates a new dataset with a small (sample) amount of data, and annotates it properly according to the given instructions.
  2. This dataset will serve as the ground truth for the training project.
    Once the data has been properly annotated, the annotation team can use Dataloop’s golden training tool (explained in next step).
  3. The golden training tool enables the annotation manager to duplicate the training dataset for each of the annotation team members. The annotators will then have to complete their training task according to the instructions provided. Each team member’s completed tasks will then be compared to the ground truth.
  4. Once all annotators have finished annotating their tasks, the golden tool automatically gives them a score based on the ground truth.
  5. This is when the annotation manager should review the data manually, and understand what scores were given, why, and how to correct them in the instructions. At this point you can determine whether the team is ready to move forward or if more training is required.

Lesson #3: Live QA

Although training helps a lot, some edge cases that aren’t covered in the instructions will still arise in the annotation process. This is where live QA comes in.

Live QA

The goal of live QA is to find any errors immediately after they have been annotated, thus “nipping them in the bud.” Rather than reviewing the errors, and fixing them after the annotation process is completed, the errors will be fixed at the beginning of the annotation process in order to avoid repeating them.

Summed Up

After implementing all of these new workflow steps on your annotation projects you’ll immediately feel a difference. This way, you not only increase your productivity and throughput, but you also reduce costs and above all, improve your data quality!

Recommended Annotation Workflow

Simplified instructions –> Training –> Annotation –> Live QA –> Submit the data

Bonus tip –
If you’re ready to implement this optimized workflow, be sure to check out “All You Need to Know About Labeling Instructions.”

Share this post

Share on facebook
Share on twitter
Share on linkedin

Related Articles

data loop

The Data Loop Phases

Data is the energy that drives machine learning. The key to successful ML is accurately labeled data that machines can decipher. A data pipeline is

Read More