As a Customer Success Manager, one of the most common questions I get asked is, ‘What does it mean to help the customer succeed?’ This can be a bit complex to explain, especially to someone unfamiliar with the challenges in the computer vision ML area.
So let’s start from the beginning: the buzzword ‘machine learning’ is a bit misleading. It’s not as simple as it may sound. Working as a Customer Success Manager at Dataloop AI, I’ve noticed that any organization joining the AI trend and advancing to the next level often encounters a common challenge—enhancing model robustness and its performance in diverse conditions. Then, the crucial question arises: How can we handle large volumes of data without compromising its quality (and even improving it) and without incurring skyrocketing costs?
The answer to this is by dramatically improving the efficiency of the data workflow. And here is where I come into the picture.
By assisting my customers in improving their data workflow and its efficiency, I help them achieve their goals and enhance their ML operations, which is essentially their definition of success. Working with ML engineers and Data Scientists from a diverse range of types, spanning from startups and SMBs to Enterprises and across various industries, has shown me that each customer represents a unique world. However, there are a few methods I would recommend adopting, as they can be relevant and efficient for anyone.

Section 1: Smart data selection
Select data carefully, investing your time and effort wisely. Select more informative data points to label, prioritizing those that would provide the most valuable information. There are two key points in the data workflow where careful selection can enhance efficiency, focusing effort and cost:enhance efficiency, focusing effort and cost:
At the Beginning of the Process:
Annotation is a time-consuming process. As an ML engineer, by carefully selecting the most relevant data after the collection phase and before annotation, you can concentrate on the valuable and representative samples. This ensures coverage of data variety and anomalies while optimizing resource use. You can use various methods for smart data selection. Here are a few examples:
You can use various methods for smart data selection. Here are a few examples:
- Cluster Analysis: Group similar data points into clusters and select representative samples.
- Outlier Detection: Prioritize samples that represent edge cases or challenging scenarios, helping the model handle a wider range of real-world situations.
- Based on Your Model’s Performance: Choose samples where the model is uncertain during predictions, improving confidence in those areas, a common practice in active learning workflows.
I’ll provide a small example of the approach through a workflow I implemented together with one of our defense industry clients. In this process, we utilize the confidence scores of the model predictions to pinpoint valuable data that warrants attention. This identified data is then earmarked for additional model training, to refine the model’s performance.

When the model predicts a specific item with a confidence level falling below the customer-defined threshold, it indicates suboptimal model performance for those items. Consequently, these items are prioritized and routed to human annotation as high-priority tasks. This strategy is founded on the premise that addressing the model’s weakest areas first will optimize efficiency, resulting in a substantial improvement in overall performance. To maximize the benefits of this approach, we will execute this process in cycles, forming an active learning loop. By adopting this method, we ensure that your time and resources are efficiently invested in the data that truly matters.
- While sending data for QA or Expert review:
High-quality training data is essential for accurate and unbiased machine learning models. Following the annotation phase, the data often undergoes a second review by another annotator or a domain expert. Despite automation advancements, domain experts’ value has increased, particularly in complex use cases where machines cannot fully replicate the human knowledge (yet), such as medical and agritech.
However, human hours, especially those of experts, remain one of the most expensive resources in the process. Selecting the right data for a second review allows us to achieve several goals simultaneously: enhancing quality, saving time, and dramatically reducing costs.
To achieve this, what practical steps can we take?
Smartly sampling data for QA allows us to pinpoint areas that require more attention. Examples of methods for recognizing those points include prioritizing areas of past underperformance of the model we would like to strengthen, evaluating expertise-intensive areas which demand more expertise for accurate decision-making, and assessing the complexity of annotation classes, e.g., where mistakes have occurred frequently in the past. For more precise and data-driven answers to these questions, I highly recommend leveraging analytics from similar previous projects, if available. This strategic approach to data sampling will not only enhance the QA process but also optimize the allocation of resources where they are most needed.
Section 2: Automate your workflow:
Automate your workflow: One of the most common methods to enhance efficiency, accelerate processes, and scale up is to meticulously define the workflow and subsequently automate manual steps.
Build your own data pipeline:
As engineers, we greatly appreciate automating processes and tasks that involve repetitive manual work. This is precisely the point: you can automate your data pipeline from acquisition to training and save time and effort. By carefully designing and defining its building blocks, you can efficiently move data from one step to the next, creating an automated and customized workflow that seamlessly integrates humans and machines.
What are common parts of the ML workflow we can automate?
- Pre processing
There are multiple processes that ML engineers perform on the data before sending it to the subsequent steps in the process. Automating this phase and integrating it into your pipeline workflow can save a significant amount of time and money.
Let me share an example of this, that I’m sure will drive the message home:
Filter out unsuitable images:
Filter out unsuitable images: In this use case, I implemented an automated step within an insurance company’s workflow. This step is responsible for assessing the suitability of each image to continue in the process based on predefined criteria set by the customer, such as image clarity, sharpness, blurring, etc. The simple script we integrated filters out data that doesn’t meet the criteria, sending the remaining items for annotation by humans. Sounds simple? Simple, yet effective.
After implementation, the results speak for themselves. Let’s look at the numbers:
- After implementation, the results speak for themselves. Let’s look at the numbers: Before the implementation, approximately 7% of items in each dataset were manually marked as discarded (unsuitable for annotation) during the annotation process, compared to less than 1% marked this way manually afterward.
- The remaining (~8% ) of the unsuitable items in this dataset were automatically marked as discarded in advance.
This translates to an ~85% reduction in the number of items flagged as unsuitable for annotation before reaching human annotation tasks. The addition of automated checking has significantly optimized the workflow, saving hundreds of human hours that would have otherwise been spent on unsuitable data. This translates to significant cost savings, allowing us to save valuable time and resources while enhancing the overall quality of annotated data.
- Auto annotation
By integrating auto-annotation into your data workflow, you can optimize human effort by allowing the model to handle simpler tasks while humans focus on more complex areas.
There are two main approaches:
Streamlining Human Effort with Model Assistance:
The model performs an initial analysis, to point out where the human resource needs to focus his effort, this way the model is reducing the workload for humans who then complete the task with complementary efforts.
For example, let’s consider a real case I designed with a Data Scientists team working in the pharmaceutical industry, where auto-annotation is utilized.
In this scenario, the model identifies key terms in the text, such as the drug name and symptoms described, and highlights them in advance for the domain experts. Then, the human reviewers use this information to determine whether the text describes an experience of side effects or not.

Human Oversight for Model-Generated Results
In the first phase, the model handles the entire task. In the second phase, humans focus solely on reviewing and fixing any errors in the results.
By implementing auto-annotation in your workflow using any of the methods suitable for your use case, you can improve efficiency and accuracy, allowing your team to focus their efforts where they are most valuable.
Automation validation
Integrating automation validation into your data workflows offers several benefits: it reduces manual effort, improves data quality, and facilitates continuous improvement in the annotation process.
One effective approach to leverage these advantages is through real-time validation. By enforcing a set of predefined rules, real-time validation can assist in early error detection, provide immediate feedback while helping improve the learning curve of annotators and save a significant time for the human reviewers.
Alternatively, post-validation may be more suitable in certain cases, ensuring consistency in data quality assessment across different annotators and projects, thus maintaining uniform standards throughout the workflow. Additionally, post-validation is beneficial for checking annotations across a whole batch of data.
Section 3: End-to-end solution: Centralizing multiple aspects of your ML workflow under one roof
If you’re involved in ML model development, you’ve likely experienced the significant overhead of transferring data between different tools. As an ML engineer, you will probably need to put your effort into handling format conversions, risking data corruption or loss, and investing effort in preprocessing.
Consolidating the entire ML workflow within a single platform is key to overcoming these obstacles, leading to a dramatic reduction in overhead, costs, and complexities associated with aligning data.
This approach enhances collaboration among team members, streamlining communication and workflow efficiency.
Now, let’s explore a real-world example of an active learning pipeline implemented with a customer, enabled by integrating all meaningful steps of the ML workflow under one roof:
In this case, I collaborated with a customer from the fuel retail and fleet management industries to automate the fueling process.
We worked together to train a model that detects vehicles, recognize license plates, and identify if the fuel nozzle is at the vehicle’s inlet, enabling or blocking the fueling process.
As known, one of the biggest challenges in building ML models is annotating large datasets. Active learning can help to overcome it. By selectively labeling only informative examples, this method helps efficiency of the learning process and achieve high accuracy with fewer labeled samples.

Using this method, we achieved an impressive ~96% of right decisions performed by the model during the fueling sessions.
But what does this number signify? Essentially, it indicates the availability of a software-based alternative to the current hardware solution. By substituting hardware with a software solution, we can eliminate the need for costly hardware devices, thereby reducing upfront expenses and ongoing maintenance costs. Moreover, the software solution offers scalability without requiring installation on each individual vehicle, making it a more versatile and cost-effective option.
By embracing a comprehensive solution that integrates every aspect of the ML workflow, you can streamline operations, enhance collaboration, and achieve significantly better outcomes. This holistic approach not only simplifies processes but also fosters an environment conducive to efficiency, cost-effectiveness, and resource optimization, ultimately driving success in ML model development and deployment.
In conclusion, improving data flow efficiency is crucial for navigating the complexities of the ML landscape, especially when dealing with large volumes of data while striving to maintain quality standards and cost-effectiveness.
By strategically selecting and preprocessing data, automating manual tasks, and consolidating workflow components, significant enhancements in productivity, accuracy, and cost-effectiveness can be achieved.
As we move forward on the journey into 2024, I encourage you to reassess your current workflows and consider implementing any improvement you find as relevant for your specific use case and the challenges you are facing.
I strongly believe in embracing a lifelong journey of growth and learning. If you found this article insightful, I invite you to explore more AI materials, such as my previous blog post or any other Dataloop AI articles on our blog.
ML engineers, we are here to assist you in boosting your ML operations ahead!