Chapter 9

AI Project Management

We are preparing to launch our AI app. We have basic models that are functional, we agreed with the pilot customer for a calibration period that allows our models to adjust to the fresh data and the data interfaces (APIs) with customers have been defined. In this chapter we will dive into the preparation and planning needed for launching and scaling our app deployment.

We now understand the pace of our AI development is set by the speed of our data processing, done both by humans and machines, time to plan our AI launch: the budget, resources, timeline, and deliverables.

Technically we plan a data feedback loop with our customer and our data labelers, in the purpose of kick starting the data flywheel effect:

Waterfall or Agile?

Software 2.0, (A term coined by Andrej Karpathy, Director of AI, Tesla) is a term that tries to manifest the big transformation needed from organization when moving from the development of traditional software-based systems to data-driven systems.

Here is a toolchain comparison between the two:

Figure 57 SW2.0 stack, investor’s deck


But tools are only part of the story, The program management of AI, the methodologies and processes required for productizing software 2.0 apps, is also very different.

After 60 years, the software industry still struggles with estimating product cost and time to market, on top of that AI introduces a new, third factor to plan: product accuracy (will it work?), we expect our software products to work 100% of the time, non-guaranteed functionality is a new challenge for delivering products, challenge that every AI must deal with once facing users.

When developing a software product, the software industry, in its majority moved its development methodology from Waterfall to agile, which means fast iterations of short development (Agile) instead of single long plan and development over years (waterfall):


Projects developed using agile methodologies gained about 2X higher success rates and much lower failure rate:

Figure 58 Agile vs Waterfall project success rates, Standish Group Chaos Studies


After 60 years of software 1.0, more than half of the projects are still failing to meet deadlines and budget. Software 2.0, or AI model development is not even close to that level of predictability, which one might argue is not very impressive by itself. Most of AI today is developed more like waterfall, rather than agile, this will change as the AI industry and tools mature (agile requires more sophisticated management flows and tools to support it).

AI projects are born in waterfall like planning and transition to agile planning once data flows and modeling architecture becomes stable and the project is no longer in experimentation mode.

Just like Agile has moved the software world from sequential process to user feedback driven process, our cat analytics planning should transition to Agile AI: fast iterations of data cycles, resulting in fast reduction of modeling bugs (bias and variance), AI solution project management based on the waterfall paradigm doesn’t not work although all AI projects are born by waterfall planning.

Let us review how the different development phases changes when moving from traditional software to AI or SW 2.0 products:

  • Requirements
  • Design
  • Implementation
  • Verification & Maintenance



The cost estimation of cloud expenses and the data collection itself are not in the scope of this book and relatively easier to estimate in most cases, these should not be ignored in your plans as they do add significant cost and time factors.




According to Project Management Institute 49% of the traditional projects development experienced scope creep, that is, scope of the project was not clearly defined and controlled.

In traditional software product development, the requirements phase is where, usually the product manager will talk to customers, users, business units and experts to set the requirements document for the application. This document will describe the value proposition, functionality, interfaces, and usage of the application. This is an important phase where all parties agree on the product landing zone:

  • Constraints (e.g., analysis time, hardware to be used)
  • High level technologies to be used.
  • Delivered value to the users.
  • Budget
  • Scope matches the expected delivery timelines.
  • Feasible, i.e., can be built and deliver the promised value.


The additional challenges AI systems bring to this process:

  • Technologies are rapidly evolving, whatever you decide today is likely to be irrelevant 6 months from now.
  • Value to users who assume everything just works (100%), AI tends to partially work (poorly, early on and almost never 100%) – partially functional and good user experience don’t mix well.
  • AI development requires a lot of experimentation which is very hard to plan, in addition once a stable algorithm is found scaling it requires a lot of data which there are no tools or methodologies to properly estimate its operational cost.
  • AI time to market is very long today, so long that no management will accept a 5 years plan, while reality shows these timelines are typical. ROI of AI systems is negative in most cases as of today, AI is an expensive game to play.
  • Feasibility estimations have low confidence, in many cases there is an assumption AI will just work, while it might take a few generations of the product to reach satisfying quality.


If we take our cat surveillance system as an example its future path must be clearly defined early on:

  • How do we identify neutered cats? How critical is this capability to our future success?
  • What happens if we fail to detect the different cat types?
  • What will be the lighting conditions (should it work at night, rain, fog, snow?)


After talking to our first customer, we learn:

  • System should work in all weather conditions.
  • Detecting cat types is a great feature, but it’s ok for version 2 of the product.
  • A neutered cat’s feature is very important for the customer, willing to pay a premium price for this analytics feature, but not critical for initial deployment.
  • Talking to cat expert, these are the characteristics of neutered cats:
    • May have clipped ears
    • Tend to have less belly hair
    • May have scars
    • Males will have their testicles missing


Our AI expert tells us that detecting neutered cats will be challenging and expected to have stability issues since the mentioned signals’ strength is very weak. We decided to drop the feature from our plan due to low return on investment.

The Super Product Disaster

During the traditional software product requirements phase there is a tendency to define the super product, a product that contains every feature and technology we can think of. In traditional software development the results are mess of features, timelines and technologies that leads to the planning issues we have seen previously, in AI products this approach is deadly for the product launch, when building an AI product requirements scope minimization is the most important thing you should do, just like we postponed cat types detection to later versions and dropped the neutered cats from the list, even though customers are willing to pay premium for it.

defining our application MVP (minimal viable product) is critical and will have a lot of impact on our success chances, was true in SW 1.0, even more important for SW 2.0 applications. Fast MVP will allow you to collect data and kick-start the data flywheel.

Cat Analytics MVP

Seems like a big challenge we wish to split into several smaller challenges, talking to our VP sales we learned that just detecting cats and count them over time is enough to get initial customers (and data), rest of the product capabilities can be released later and will be priced separately.

We also agree that 85% initial accuracy is good enough for pilot.

The above is both typical and very important, AI product needs to have development and sales teams communicate strongly, identify the “Must haves” of your solution will be critical and the SW 1.0 approach of endless requirements list at day 0 will put the entire product on high risk for failure.

Remember, sales need product to close a deal, product need the customer to get data, data is needed for developing the product. The only way to break the cycle is by joint work of the teams, working together on minimizing the scope, enabling easier startup of the data flywheel effect.

AI task type

The first thing we define is the task we expect our bot to do, since we are dealing with computer vision, here are the common task type:

Figure 59 cat annotations,

Classification – Classify the cat as a label on the entire image (does image contain cat?)

Bounding box – Mark all the cats objects areas with a bounding box

Pixel mask – Each pixel is labeled if it is part of a cat or not.

How do we choose?

There are two basic guiding rules:

  • The more accurate the marking of the signal the better the noise filtering, where pixel segmentation has the ultimate precision, 0 noise pixels are added to the annotation.
  • The more accurate the marking the more expensive is the data as well as the compute to train and use the model.

We decided to go with a bounding box, it presents a good balance between signal capturing and labeling cost and will allow us to detect and count cats properly.


There are other tasks available in computer vision, the above are the more common examples.

Same apply to any signal analysis & extraction tasks like common tasks in text processing (Natural Language Processing): Sentiment analysis, entity recognition, translation, Semantic search

Defining Accuracy

Notice how easily I used the term accuracy so far, like it is well defined and agreed, it is not. Even the model scoring method selection is a process by itself, usually use case adjusted (IUX, Intelligence user experience). How do we define our accuracy (i.e., what is considered good) is critical for our success and impact both our internal communication inside our company as well our external communication with our customers?

  • Our developers need to agree on a goal for the improvements and models.
  • Our program manager needs a way to track progress toward the goal.
  • Our sales need to agree on a success metric with our customer.
  • Our support needs to understand the severity of issues and communicate what is the expected behavior.

As mentioned before, setting a good accuracy measurement is a task by itself, we will cover the fundamentals of it, after all our MVP is simple.

Let us start with the most two basic components of our accuracy metric: precision and recall.

Given an image containing cats, we define the following cat counts:

True-Positives, Correct predictions – The cats (signals) which we identified correctly.

True-Negatives, correct ignores – The background (noise) which we filtered correctly (not a cat).

False-Positives, Wrong predictions – Bad detection, when we see a chair as cat as example.

False-Negatives, Wrong ignores – The missed signals of our model, i.e., the cats we missed in the image.

With the above standard definitions, we can better define what accuracy means.

Precision – correctness

Precision is defined as follows:


Precision tends to be what most people will call accuracy, but it is not enough. Let us have an example:

Let us assume our model will detect 10% of the overall cats and is 99% correct when deciding on cat, assuming we have 1000 cats’ images:


So, precision is 99% yet we miss most of the cats (900) since our model detects only 10% of them (100). precision only measures our detection correctness, not its completeness. That is why we must look at least on another accuracy factor, that shows us the completeness of our system, recall.

Figure 60 Precision & Recall, Walber, Wikipedia


Recall – completeness

Recall is defined as follows:


How does our recall look in the previous example?


As you can see, recall exposes the incompleteness problem of our model.

There is a balance of our modeling between precision and recall, this balance reflects our business needs between correctness and completeness of our result at any given point, translated to the bias-variance tradeoff of machine learning experts optimize for.

What is the Correct Prediction?

As with accuracy, deciding when a single inference output is correct is not as trivial as it might seem and has its own nuances.

The definition is good detection is up to our goals and needs, which one of the following would you consider correct detection of the cat?

Figure 61 White walking cat, Annotations: Eran Shlomo


Selecting out evaluation metric – Jaccard index (IoU)

Given a task there is usually an “off the shelf” evaluation metric to use, yet eventually the evaluation should reflect the user experience of the product and is expected to evolve as we gain better understanding of usage. We select Jaccard index with IoU threshold of 50%.

Jaccard index, also known as IoU (Intersection Over Union) is a popular metric to measure object detection correctness in computer vision tasks,

IOU defined per object as following:


Which is a fancy math definition for this:

Figure 62 IoU, Wikimedia, Adrian Rosebrock


Here are some IoU examples:

Figure 63 IoU examples, Wikimedia, Adrian Rosebrock


With the following definitions we move on:

  • IoU for good cat detection is 50%, solid for counting cats which is our MVP.
  • 50% IoU is good for counting, but might be an issues when classifying cat type, we will need to handle later if issues pops up.


ML Pointers

  • Read further on statistical hypothesis testing and p-values for better prediction quality definitions, especially if you are working on medical applications.
    In many cases it is easier to communicate a single metric with all teams around so creating a single metric from the different internal ones has value, read more about scores as single metric examples.



70% accuracy used to be considered pretty good with traditional computer vision, but the average user is very sensitive to mistakes. 85% accuracy is a good benchmark for initial achievable and acceptable quality for many cases but for others, like self-driving cars, cancer screening such performance will be a disaster. Lucky for us, since competition is still low, the customer approves 85% for launch if we improve to 95% within a few months.


Our First Data Pipeline

When we design our system, we want to keep it minimal but also allow room for future extensions.

A significant part of our design phase is setting the data pipelines architecture: How the data flows from our users to our models and then structures according to the different skills, their ontological relations between them and how we connect the human experts (data labelers) to create a continuous learning machine.

We come up with the following data pipeline design:


Notice the data loop, this is the basic continues learning data loop, allowing our model to improve as it is learns about the world through labeled data, bringing to life our data flywheel:


As you can see it’s pretty straightforward design, allowing us to continuously improve our perception layer model through a basic data loop. This is our Semi-Artificial skill; all cognitive skills are semi artificial although in most cases the human labeling part is not visible to the end users.

Technology Stack

Once we have agreed on our skill high level architecture, we need to choose the technology stack we will use, when choosing technology stack, we consider the following factors:

  • Constraints – Cloud or edge, available hardware, speed, power consumption (Battery), internet connectivity, security, data availability, privacy, security, and regulation(laws).

  • Expertise – The tools our experts are familiar with and already in use.

Our constraints are pretty convenient for our system, internet connectivity available, customer is comfortable with cloud service (API, Application Programmable Interface) and no sensitivity to speed or power consumption. A simple cloud hosted cat surveillance API is chosen.

The next decision we make is about our neural network framework, where two frameworks are leading the market today, PyTorch and TensorFlow.

While we hear all around the PyTorch is lighter and easier to work with, we still select TensorFlow because of the rich community and open-source code available, it is hard to resist something that works out of the box. Worth mentioning that with the proper data pipelines and recipes we will be able to change this decision later, so this decision is reversible and risk is low, we feel comfortable taking it.

Finally, we need to select or develop our neural network architecture (how the neurons are connected), since we are very early on, we just take an existing network we will tweak a bit, after some conversations with experts, Yolo architecture is taken as base, we will modify it as time goes by as needed.

A summary of tech stack options:



  1. Modeling environments, frameworks and hardware is changing very rapidly, and this trend is expected to stay for the coming years as the industry and academia keep inventing better, cheaper, and more accurate frameworks, architectures, and hardware.
  2. We selected two different hardware processors: CPU for inference and GPU for training, as of today this is common optimization for cutting processing costs, yet with rapid industry changes this optimization is probably short lived.
  3. It is always a good idea to set your AI architecture independent to specific design tech stack choices, while it is not trivial it will leave room for future drop & replace of the above elements as the industry evolves.

Verification and Maintenance

We already have verification built into our flow with the humans constantly scoring our model performance. Since we want to move fast, we decided it is enough for now to make sure the accuracy is measured by the customer as our single quality metric alert criteria, if the customer sees 95% precision on his data, we are good to go. If not, we will manually dive into details and fix. Keeping things clean and simple as we start.

ML Pointer

Read further on data and model drifting

Task Difficulty Properties

AI development is a process of separating meaningful things in our data using annotations, representing the signal we extract from the noise. The noise usually will be the environment data, the parts of the example which we don’t care about. It is critical to understand our target signal properties for proper planning: resources, cost, timeline, and deliverables.

I have tried to find a solid theory for doing such a plan for years and could never find one. I choose not to ignore this part (like many other ML books) and in this chapter I will give you some empirical tools for such plan but keep in mind, like all AI, closing the conclusion using explainable analytical formula is quite challenging, this is a tool based on our experience that will fit most cases.

The task difficulty is determined by 3 main properties:

  • Signal resolution – How many different target types we expect and their differences
  • Signal strength – How much of each data example is a signal, meaningful pattern
  • Noise level – The amount and variety of unrelated patterns the environment data may contain.

If we add our AI accuracy targets, we get to this simple formula to estimate the needed investment in data collection and labeling:

Reminder, we described the AI development process as communication channel between humans and machines, where humans are encoding messages(sentences) in the form of annotations and the AI model is decoding the information carried in these messages into a model during the training, gradually generalizing the information pieces into a smart machine, machine that imitates the intelligence of its human teachers. The basic principle is simple, the more similar the signal to the noise, the more data we need to properly separate them.

Before defining the annotation information properties, let us drill down to the messages we deliver over the data: the sentence, the annotation, the signal.

The signal is the answer(sentence) to the question we are asking. In AI (deep learning) world the signal is the annotation, the annotation is what separates the information from the noise, this is how we let the bot “understand” what the important patterns information is we are looking for out of the whole example.

The annotation(answer) usually will be consistent from two parts:

  • Label and attributes (the “what” part) – Signal Classification: the meaning.
  • Coordinates (the “where” part) – Signal Pattern: Location, position, shape, size.

For non-vision tasks we can generalize the coordinates and meaning in a much broader sense, in text/NLP/Signal problems coordinates can be characters/words offset and in the audio world it might be track number, time offset and segment duration.

The neural network has two goals in detecting our signal:

  • Noise separation – Separate the signal from the noise, identify the cat in the image.
  • Signal classification – Classify specific signals out of many similar ones – What type of cat do we see?

Our data volumes will be impacted by both, let see challenging examples for signal separation and classification:

Where is the cat? Weak signal in noisy environment

Figure 53 Whataloadarubbish . . . by Milton Creek, cc-by-sa/2.0 – © Stefan Czapski, Edited by Eran Shlomo


What happens to our network when we classify this entire image as a cat? It will bring significant variance to our model; the problem is not with the classification (the image indeed contains cat) but with the coordinates. We must create better isolation of the cat for our model using a bounding box, or a segmentation mask:


Hard classification: Minor signal differences

Figure 54 Black panther, Jaguar, Black cat,


What happens to our network when we train it to classify a black Jaguar as not a cat? A neural network that tries to identify simple cat/no cat images will either identify Jaguar as cat (bias error) or will be very poor in detecting cats (variance error) since we teach it to ignore cat features on the Jaguar example. The problem is not with the network, it is with our ontology definition. Simple cat / no cat ontology will not work well for a system that needs to take care of both (Zoo analytics system?), we don’t expect Jaguars in our systems, so our ontology is still fine.


Signal Resolution - Ontology

So, we are looking for the needle in the haystack. But there are many different types of potential needles and sizes, finding big needle is much easier than small one and once we have found it, we usually are also interested in its label (category, cat type).

Figure 55 needles, Shutterstock


We define the number of categories we have for the label as the signal resolution. We are now defining a new concept, signal resolution. The ontology, sometimes referred to as label map, is the description of all the potential signal types and their relations. We will dive into this topic later while discussing ontologies in more detail.

In simple terms the more options we have for the cat (i.e., cat types we would like to detect), the higher the signal resolution is and the higher the signal resolution the harder it is to classify the signal (“the what”, the label) of given target and the more data we need.

Signal Strength – Target Size

The size of the target we look for is the signal strength. The larger the signal the easier it is for our neural network to find it. Here is example with both weak and strong cat signals

Figure 56 Phillip Perry / Cat crossing the road – Olliphant Street, W10 / CC BY-SA 2.0. Kitten,


The weaker the signal strength is the harder it is to find the location (the where, the coordinates) of a given target and the more data we need.


The Noise – Environment

The signal environment also matters, in other words the haystack plays a significant role in the task of finding the needle.

The more diverse and complex the signal environment (background) the bigger the noise (non relevant information) and we need more data for our network to learn how to separate the signal from background.

There are two important factors of our noise:

  • Noise variance – how noisy our environment is?
  • SNR, Signal to Noise Ratio – How large is our signal relative to the environment, smaller needles are harder to find in a big haystack.

Very low SNR and high variance example:

ML Pointer

The data volume question is essentially the SNR filters (signal/noise ratio) problem, indicating information theory could be useful in understanding the behavior of data volumes analysis.

The smaller the signal(needle) and bigger the potential noise(haystack) you need more data, or, for a given accuracy requirement the needed data volume goes up as SNR goes down.

Data Plan

We spent some time defining success metrics, technology stack and requirements with the different teams: Management, Business, R&D and Operations.

Once we have the fundamentals in place, we can continue to our data budget plan, basically answering the question: how much labeled data is needed for 95% accuracy?

If you search this question on the internet you will find it impossible to answer, every time this question pops you will see many vague answers and unclear estimations, no worries, we will get to an answer soon and moreover the tools to update your plans as your own data behavior is revealed.

While it is indeed an extremely subjective answer per case there is still a way (and a requirement) to estimate volume, cost, and time for our plan, with sufficient guard bands the plan should have reasonable confidence.

So how many data examples are needed for launching a model?

Well, that’s a wrong question to ask, do not think of the learning process as a batch process, a process in which we collect some number of examples, label, train and deploy a model. The learning process is a stream process, an endless stream of data, fed into our data pipelines, generating results, and learning from mistakes.

Bug, we must start somewhere … All models start as a batch process (the startup batch) and the lucky ones that go into production will transition to the stream process. The transition to stream process should take place upon reaching stable ontology, a topic which we will cover in more details in the next chapter.

Rule #9: Machine learning starts with batch data processing and transition to stream processing

So, we start with the batch process and the question remains, how much data should we plan for?

Here is a simple formula that is good to go with most of the cases:


Let us go over this formula in details.

Signal Resolution Estimation

The first meaningful parameter for our plan is the target details or the signal resolution, we will use our ontology size (number of categories and properties) as our signal resolution estimator. In simple terms how many types of things we would like to recognize. Since we want to recognize 20 types of cats, our signal resolution is 20. If we would like to identify 5 types of cats, then 5 is our resolution.

ML Pointer

If you are adding additional properties to assist with modeling like “partial cat”, “occluded cat”,” standing cat” you are increasing the signal resolution. The resolution is set by the number of labels you feed the model with, not the ones you report out as result.

Signal Strength Factor Estimation

The second parameter is our signal strength which has two parameters of its own affecting it:

  • SNR – Signal to noise ratio, The size of the cat in the image
  • Noise variance – How dynamic our environment is


We can estimate both.


image of 800×534 pixels and box over our cat in the size of 145×214 pixels



Figure 64 “cat” by , annotated



The lower the SNR the harder the object to detect, SNR factor reference:



This means our SNR factor is 3000 and we should communicate to customers the minimal detection size we support.

You should avoid trying to resolve problems with SNR lower than 1% in a single neural network, although there are cases where noise is very stable, and it will still work reasonably well.

SNR of 1% looks like this:


We got the SNR communicated to our customer (we will not detect small/distant cats of smaller than 1% size), let us continue with scoring the noise.

ML Pointer

SNR should be calculated based on the actual data size fed into the model, since we selected yolo, our selected modeling architecture, it accepts 416×416 images as inputs, so SNR is calculated on the resized image.

Noise Variance Scoring

Now for the noise variance:

Score the noise variance of your environment 1-3, where 1 is very stable and 3 is very dynamic, here are examples for parameters that will cause higher variance:


Day – Night

  • Weather conditions
  • Noise that looks like a real object
  • Moving cameras/Objects
  • We score our noise as 2.
  • Calculating our startup batch size

Let us recall our startup batch size estimation formula:


We need to plan for approximately 120,000 labeled data examples (highly variant cat images).

And how much will this cost?

Data Collection Cost

Data costs are changed rapidly from case to case. In our case we assume we will just scrape the web for data and add on top of it our pilot customers data, so it’s pretty much free, in many other cases data collection might be an expensive process by itself.

Let us go over some examples for common data sourcing strategies, where each one comes with different costs associated with it.

External Data Sourcing

There is a growing market of data acquisition services, either raw data or already labeled data. Where will you find opportunities to buy data?

  • There are crowdsourcing data collection efforts starting over the world, even from known cloud providers. This can be attractive when the data you are after is “every day” data, where internet users can just collect it for you.
  • Data from strategic partners, in many cases some partners may have the data you need, and you can have sharing agreements, for example closing agreements with hospitals for medical data.
  • Data collection service providers, just like with labeling services there are service providers who are specialized in data collection.

Staged Scenes

In some cases, mainly in human/object analysis products (retail, e-commerce, social media), it is common to set up a staged scene and collect data from this scene. These efforts will include everything you can imagine from movie scene shooting: Scripts, players, Set and props where the end goal is the data from the staged scenes. Another example for staging collections is retail products, where each product in the store is pictured for the purpose of machine learning.

Sensors (Devices) Deployment

IoT is a very big trend in the technology world and if we can summarize its most hyped value: data.

Deploying sensors can be very expensive and usually precedes any AI activity, Smart factories for example require huge investment in sensors, sometimes over years, to enable data collection infrastructure.

Even in our cases the data is not free, the customer (city hall) indeed has deployed and operating a smart city system, with thousands of cameras capturing the city streets. While such a system is very expensive, we get access to this data for free as part of an agreement with our customer.

Data Collection Team

In many cases, common in agriculture and self-driving cars you will have data collection teams, teams who essentially go for data collection tours in the fields, with a careful collection plan containing various sensors, settings, locations and parameters.

Usually, these efforts are a combination of model testing and data collection so the next time you see these sensors overloaded cars, this is what they do: Driving around the city, testing the car, and collecting valuable errors data.

Past Data

Often, our AI capability will come over an already existing product, which means we have the past data collected by our product which we wish to use to enhance our product. While it may seem, this data is for free, most of the time it will require significant investment in organizing and making it machine learning friendly (data cleansing). It is very rare that previously collected data is compatible with new AI needs out of the box.

Labeling Cost

The Naive Estimation

Time to estimate the budget for our labeled data, labeling automation is usually ignored in the early stages while in batch mode, so we will estimate fully manual workflow. Automation can reduce costs by 90% but requires developers and machine learning resources by itself. I would always recommend thinking about automation from day 2 and making your ML team practice reducing data costs early on.

A common mistake is to evaluate the expected cost like this:

  • It takes 3 seconds to label an image,
  • 120,000 images will take 360,000 seconds:


If we take cost effective labeling workforce, where minimum wage is $0.25-1.5/Hour we get a conservative estimation:



That could not be more far from reality, let’s take a deeper look on labeling cost structure.

Labeling Cost Structure

This is how typical labeling cost structure is distributed:


On a typical project this is how Labeling cost, Management and QA will distribute:


So, whatever our workforce cost is we should multiply it by 2.5 for overall cost.

The workforce cost has 3 major components to it:

  • Expert time (workforce)
  • Infrastructure
  • Tooling

The expert time cost

The expert time cost is derived from:

  • Expertise
  • Geography
  • Experience

The required expertise has a strong impact on the cost as you can imagine that a doctor labeling cancer cells cost is different from agronomist labeling plants disease which is different from anyone being able to label cats.

While a single FDA ready labeled MRI scan example labeling can cost up to $2000, we can use a less expensive labeling workforce.

The low end of the labeling workforce is around 2$/hour and is very sensitive to global economy trends, currency crises and other factors. The more stable low-end prices will be around 3$/Hour.

Workforce training & quality matters. Working with a less trained and educated workforce will usually manifest itself with higher management and communication costs, eventually leading to time loss and data quality issues. Cheap datasets are cheap for a reason, though it’s a common practice, paying top dollar of machine learning experts and then providing them poor quality data makes little sense. The data quality is a much more meaningful factor to your AI results then expert with some degree or the other. Invest in ramping quality labeling workforce and process.

My recommendation usually is to create a hierarchy of workforces, making the skilled workforce validate and monitor the less skilled one. The above recommendation will result in an average cost of 6-8$/hour.

Tools and cloud infrastructure will likely add another 1$/hour.

So, labeling itself will have a final cost of around 8$/hour.

And the time spent by labeler on each image?

  • Review the image – 15 seconds
  • Mark the objects – 12.5 second/object
  • We estimate 2/obj for image on average

So, a single image with a single object is more like 40 seconds rather than 3, which brings our estimated time to around 1300 hours.

This makes our labeling cost :1300*8=$10,666. Multiply it by 2.5 for adding quality and management overhead and you got the total cost of our labeled data as $26,600.

The final number of our budget is $26,600, which is far from our initial naive estimation of $150, while this may sound simple, many machine learning developers exposed to high quality labeled data costs for the first time get shocked, often resulting in initial effort for cheaper, lower quality data and process. The path for lower labeled data total cost of ownership goes in the opposite direction, higher initial investment in talented labeling workforce, capable of operating with automated labeling data pipelines and tools, that investment only makes sense if there is actual business case behind the project.

SW 2.0 is much more expensive…

Yes, AI projects costs are much higher than software projects: data collection, model training compute and labeled data costs are factors traditional software did not have.

Usually, early projects are developing in all house capabilities: Tooling, workflow management, and Quality, ending up with paying externally only the 4-5$/hour budget. If you are doing that you are indeed paying much less externally, but the overall cost is probably 3-5X and you lose twice: internal overload and costs are much greater while your time to market increases. Using in house tool chains makes sense in early days where the project tries to find its product market fit. Once the project proves the business case, it should switch to using the best toolchain the market has to offer and use internal resources to focus on core capabilities and winning the business.

I have seen many AI projects follow the cheap path, spending 2-3 years before acknowledging the real costs. Just the cost of average scalable SW system development to run this process is around $1-5 million, depending on the system complexity, the yearly maintenance is around $0.5-2 million. While some projects succeed with developing toolchains and solutions , the other 99% are getting one step closer to failure with attempting to go all in house.

87% of AI projects are failing, the above is a significant factor in that, right after no value/business model, and they usually come together.

Labeling efficiency variance among human workers

Data labelers should be viewed as data developers, looking at the data labeling work as data development work will help you reuse most of your SW 1.0 instincts and experience with SW 2.0.

From my experience:

  • Workforces have different speeds as teams, probably due to better management, skill, and process. Excellent teams doing 3X faster compared to lower performing teams. This means a high performing team at 8$/Hour is more cost effective than a low performing team at $3/hour.
  • High performing teams deliver better quality, which makes the output both cheaper and better.
  • Team familiarity with the SW tools matters, Team that works with SW they are already familiar with has about 2X impact on speed.


On the individual level we see that 10% of the team is 3-5X faster. for most of the teams (even low performing ones) just like with software development there is a clear indication of talented data developers, identify them and invest in them as any other talent in your org.

Here is a typical look on the performance for data labelers, or data developers:


Notice the nonproductive zone, it is meaningful and the source for many of the data quality challenges you will experience in the future. Just like you invest in the talented data developer, you should also make sure only competent data developers are allowed to touch your data.

Labeling Workforce Challenges

Data labeling is hard work, and as of today the toolchain in its early days and the work tends to be monotonic and in some cases like content moderation even present mental risk. If you add the market pressure for minimal payment and that talent is not recognized and rewarded you get a workplace in which people are leaving and hired at a high rate, this is true for the vast majority of these service providers (though we see providers who adopt different approaches successfully). At Dataloop we spend quite some time and effort on improving this process both for the workers and data consumers with strong attention to the hiring dynamics:

  • The average worker will stay in job for 8-11 months
  • The average labeling service provider is losing 10-15% of its employees every month

That level of attrition is extreme and the source for many pains on the path for high quality data. If your labeling workforce organization health is weak, it will have its toll on your time, cost and quality. let us go over the symptoms of high attrition labeling provider:

Quality Drop

Often a workforce change behind the scenes will result in sudden or gradual quality drop. In many cases the process will be stable and over the time data quality will drop due to high workers replacement rate, constantly adding new workers that have not yet fully aligned with data recipe. This drop is the most damaging since it’s silent, no indication it is happening. That is why a significant QA budget is critical to maintain consistency and quality over time.

Short “Team Memory”

Another effect of high-rate replacement of data developers is the feeling of the recipe developer (usually the AI lead) that the teams keep forgetting things already agreed and delivered correctly previously. It will feel like the team learns and then forgets, bringing frustration into the process. This problem requires constant training, boarding & qualification process per developer and task.

Worker Productivity Drop

On average it will take 3 days for experienced labeling developers to reach maximum throughput on medium-difficulty tasks. That means that besides quality issues the work pace is constantly affected by changes, effectively decreasing the average output/hour/worker due to more worker training overhead time.

Increasing Labeled Unit Cost

It will take 3 days for a worker to be familiar with the tools involved (regardless of task complexity), rapid changes of workers and tools can increase labeled data prices by 2X, this slow down adds up to the specific task instructions training time overhead.

To overcome these issues, you will need:

  • Better paid and treated workers, workers who don’t want to quit as soon as they can afford to
  • Quality tools, flows and analytics
  • Data Labeler learning, qualification, and task quality score system

CrowdSource VS Dedicated Team

CrowdSource labeling is a service where many different people can register around the world and perform labeling tasks on the internet, directly and indirectly:

  • Direct – Workers are registered to earn money, like freelancer labeling.
  • Indirect – Internet users are asked to perform labeling tasks to get access to some service, like captcha or labeling Ads.

Crowdsource costs are lower compared to dedicated force due to the following reasons:

  • No regulation – Minimum wage or age
  • No infrastructure costs – users using their own devices and offices/homes
  • Smaller management overhead
  • Indirect labeling time is free

Crowdsource on the other hand has challenges that do not exist with dedicated workforce:

  • No communication with the worker
  • Worker efficiency do not improve as tasks are changing rapidly
  • Worker’s replacement frequency is very high

The result of these challenges is that CrowdSource data quality challenges are much bigger, many times 3-5X compared to dedicated workforce.

When is Crowdsource effective?

  • When data can be seen by the public, like a product picture.
  • When tasks are simple to communicate, define and measure.
  • When expertise distribution is too broad for central location setup: many languages, geographics, cultures

From my experience crowdsource is a great addition to mature data pipelines on simple tasks, but unless you must, never start a fresh model with crowdsource but identify crowdsource opportunities along the way, especially on the mature and simple parts of your recipes and ontologies.

Transition to Stream Processing

We have seen that data collection & labeling is a diminishing return process, which means every new example information contribution to our overall accuracy goes down when compared to past examples. As we add more data, we get closer to the model saturation accuracy. The accuracy limit which we will never pass. The saturation accuracy is set by our modeling architecture, input definitions, ontology, and recipe, once these are stable (and we want them to be) the limit is set.

The behavior of a given model as we add more data to it can be described as follows:


As you can see, the more progress we make the more expensive it becomes to improve. Once we have our initial batches ready and our recipe stable it is critical to transition the mode of work from batch processing to stream processing where automation starts to play a bigger role.

Two important considerations:

  • The closer we get to saturation; the dominant factor becomes edge cases collection rather than more of the same data.
  • Once in saturation our model is not fire and forget, there is continuous data collection and labeling since data distribution tends to change over time, focusing on edge cases and quality assurance monitoring will be needed constantly.

ML Pointer

Using this model you can predict your improvement over time, the expected costs reaching it and timeline. Use benchmarks to estimate the saturation accuracy your model will reach.

While there are many ways to model diminishing return, a simple one was selected, exponential decay but other diminishing return distributions can fit just as well.