Snippets From ML DataOps Summit: How to Navigate Data Tooling for Better Data Training

If you missed the iMerit Technology ML DataOps Summit with Tech Crunch then we’ve got you covered. We had our CPO and Co-Founder Avi Yashar speak at this virtual event. This event brought together 2,000+ data scientists, engineers, and ML professionals to learn how to navigate data tooling and achieve high quality AI training data.  Here’s a recap of a few great questions addressed and Avi’s responses.

Question #1: What do you think customers are looking for when it comes to Enterprise grade software vs holistic software?

We’ve learned in Dataloop over the past three to five years that enterprises and larger companies are looking for a stable and a well polished enterprise grade platform.  They’re used to a premature tool. These companies are in the research or development phase; and only 5% of them are actually in the production phase. But – we’re now seeing a transition spanning over the next 2-5 years. These companies are starting to shift into the production phase of their AI application. When this happens, they will need tools in order to scale up without having to worry about any quality issues. They will also need to create a customization of their workflows in order to connect humans and scale their workflow, but most importantly will be ensuring the data quality. They will also be strict on SLAs. This is where the human in the loop will be able to provide them with answers for end customers which will require SLAs, this I think will happen in the next 2-3 years.

Question #2: As a Customer Service Engineer, how do you respond to the big “customer’s needs”?

The first thing is to get a better understanding of your market and your customers. It’s critical that you have a long runway, and you have to be able to justify every dollar that you’re putting into the roadmap and the product. From the tool perspective you need to understand your customers inside and out and you also need to understand your users. You need to be able to look at the value you can provide with the minimal amount of development phase. Furthermore, you need to look at your roadmap and really understand the next phase.

Question #3: Do you find your customers are only using a small piece of the functionality on the platform you built. How do you nudge them in the right direction?

In MLOPS it’s all about communication and collaboration. The data scientists can be in  Tel Aviv, and the data annotator in India. How do you get them to communicate between them? You discover more and more effective tools to work with.

Question #4: Are you pursuing non-traditional boundaries of deep learning?

I think there will be a transition in the area of data annotators. They’ll be involved  between data annotators to data developers, just like software developers. They will work on the data, they will be data developers and then they’ll go into developers of data pipelines. They probably have the best understanding of data. They see a lot of examples, they know what’s working, and what’s not. They see what’s easy to annotate, they can easily identify edge cases and anomalies. Eventually, they can provide those data developers with the best tools to break the barriers of working on an entire data set to provide them with efficient tools, to allow them to work on micro-tasks and help them break down these micro-tasks preventing quality issues. I think this will be the next phase of deep learning. There will definitely be self-supervised learning with new generalities of new synthetic data filling up the gaps, but eventually the ones working on the data will be involved in the deep learning space in the next 2-5 years.

Question # 5: How do you test the stability and scalability of your platform as it pertains to data annotation?

We have different environments where we load a lot of events for different users. For example semantic segmentation to video processing. We have another simulation of data ingestion where we upload millions of images at once and see the bursts and process it. We test each environment in a specific scenario simulating how the customer will use the platform in provocative ways and also in what we call the “silence wave” when there’s not a lot of traffic running into our system. This is referred to as Customer Oriented Scenario testing. It’s a lot of stress testing both on the user level and across the system.

Final Thoughts

This ML DataOps summit discussion gave a very interesting showcase of how enterprises are able to navigate the ecosystem of data tooling. While prioritizing the understanding and the importance of utilizing two key elements – data tooling and a skilled tooling workforce. Showing how you can ultimately achieve the goal of attaining high-quality training data sets for AI. If you’d like to hear it for yourself be sure to catch the entire discussion led by Jai Natarajan of iMerit. Watch the full discussion.

Share this post

Share on facebook
Share on twitter
Share on linkedin

Related Articles