The Ultimate Guide to Data Recipes

If you’ve ever cooked before you know how important it is to follow a detailed recipe with clearly set out instructions for preparing the particular dish.

However, today we’re talking about a different type of recipe – the data recipe kind. Dataloop’s recipe builder provides data scientists with the flexibility of creating the ontology they need for supporting their data-centric approach to managing data annotation tasks.  But let’s take a closer look and delve into what this means, starting with the basics. 

What is a Data Recipe?

To start, a recipe is composed of two main components: ontology and instructions (or data recipe). Let’s start with the data recipe. 

The data recipe is the human-readable instruction that explains to the data annotator how to annotate the data. The more expert your AI becomes, the more critical your recipe becomes.

A recipe has 3 main roles:

  1. To create a frictionless labeling instruction medium.
  2. To enable automatic UI simplifications and adjustments.
  3. To enable automatic quality checking.

Here is an example of an AI knowledge flow, in the context of a recipe: 

The recipe translates instructions between human natural language and AI language, and the data annotator is the translator: 

The recipe is a prerequisite to data quality.  

The recipe has an instructions list that data annotators use in order to label the selected item properly. It contains enough examples, rules, and explanations to allow data annotators to label data as if they were experts of the items themselves.  

The AI narrowness plays an important role in the recipe as well. Narrow AI is narrow information transfer from top field experts all the way to machine learning models. The narrowness is what allows small errors in this communication flow, both in the human-to-human communication (recipe) and the human-to-machine communication (annotations).  

A good recipe allows a smooth and fast training process of data annotators that would result in accurate, high-quality data.   

The Process of Teaching Humans to Scale  

Let’s take agronomy as an example. Domain experts such as agronomists are hard to find, have very limited time, and are usually very expensive. Additionally, AI development activity will need a constant supply of labeled data, which is produced by dozens to thousands of people. However, introducing a recipe, with a clear and accurate set of instructions enables fast and accurate knowledge transfer between humans with different levels of expertise. This will eventually lead to much faster labeling speeds and it will also lower the cost of labeled data.     

Since domain experts are expensive, every AI activity needs its domain expert to narrow their expertise to a simple list of instructions. This enables you to scale the labeling process with many more humans, keeping the top talent in charge of edge cases, quality monitoring, and complex cases.   

For teams that are developing AI with a large amount of labeled data, they’re going to need certain labeling skills in order to teach the annotators before they actually start labeling the data. This is not always possible, which is where the 5-minute benchmark test becomes relevant:   

The “5-minute” Test:

“If a labeling task cannot be taught by an expert to a non-expert within 5 minutes of face-to-face training (which covers 80% of the samples), then scaling the task to non-experts will be difficult (slow & expensive).”  

If the expert requires even more complicated labeling, then it is recommended to break the labeling task into several subtasks, each one well-defined and easily communicate-able.  

Creating a human annotator training process requires us to:  

  1. Implement a training mechanism: teach humans the skills.   
  1. Undergo a qualification process: test the annotators who are qualified to do the job.   
  1. Implement scoring: ongoing scoring of results by a human annotator.  
  1. Input referencing: while labeling, have a library of references, “what if/how-to” examples to be gathered and available for the annotator to refresh.  
  1. Give feedback: the ability to raise issues/questions on poor labeling or edge cases by experts to data annotators and vice versa.    

It is important to note that often during the initial model development stage, the recipe instruction changes and updates frequently. This happens as we continue to add more data and see the modeling results. Once the labeling instructions (and as a result the model architecture) stabilize, we can then start scaling.  

With this process in place, our labeling workforce is ready with recipes that are properly defined.

Ontology

Ontology is a part of the recipe. The ontology of a dataset is the building block of your model and will help you define the object detection your trained model provides. It is a label map in its basic form that comes with more powerful capabilities. The ontology is a part of the recipe containing the labels and attributes. Labels (like classes) are the signs you use to classify your annotations. 

A well defined label hierarchy enables annotators to accurately classify annotations based on logical structures. Dataloop provides advanced tools for label-based searches and filters on an item and annotation level.

Attributes allow additional independent degrees of freedom while building a world definition.

While it’s your job to define attributes according to your business needs, at Dataloop we typically look at the definition as the “answer sentence” to the following model question:

<subject (label)> <verb(attribute)> <adjective/Noun (attribute/label)>

Summed Up

With Dataloop, it’s easy when you can manage your data recipe versions and ontologies, allowing you to follow a clear, concise, and well rounded set of instructions. A good recipe will give you structure and will allow your data annotators to train and produce accurate, high-quality data. Additionally, it’ll limit human error while reducing the average annotation time. 

If you’d like to learn more about Dataloop and speak to one of our representatives, you can set up a 1:1 personalized demo here. 

Share this post

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Related Articles

data loop

The Data Loop Phases

Data is the energy that drives machine learning. The key to successful ML is accurately labeled data that machines can decipher. A data pipeline is

Read More