Intro

Intro

Over the past decade, I have spent a significant part of my time working on data and machine learning systems, always with a human angle attached to it. I have gained a lot of knowledge, experience, and led the development of several human-machine expert systems. In my current role, I’m the CEO & CTO of Dataloop.ai, which develops a data lifecycle management platform.

In general, I find that the value of advice given is usually strongly correlated to the domain expertise of the person offering the advice. So, before you take my advice, I’ll provide you with a bit more information about the path I’ve taken over the last few decades while I was working on these human/machine data-driven systems.  

 

Let’s dive right in.  

 

At the age of 12, I was already fixing PCs. First just as a hobby, and later as a gig. In 1995, I began to notice that some kids with internet-connected PCs had nude pictures in their browser history. I joined the Army in 1996 and returned to the computer scene in 2000. By that time, nude pictures were replaced by porn videos. In 2001, in my first year at the Technion-Israel Institute of Technology, I came up with a startup idea for internet safety software.  

 

The idea was as follows:  

 

  • Install a monitor application on PCs  

  • The application will sample screen activity 

  • An algorithm will analyze the result and raise alerts for suspicious or unsafe content 

  • The suspicious content will be sent to India for human moderation (annotation), and we will improve the algorithm over time as we collect more examples and automate most of the work 

 

I started assembling a bunch of students to join me and went to a VC (Venture investors). The very short conversation went as follows: 

 

“The beauty of the internet is anonymity; do you want to destroy that? People will never install spy software on their PCs!”  

 

“But the kids … you don’t understand what’s going on inside …” I tried to save the doomed conversation.  

 

“Why don’t you go to work for a big company, like Intel. Learn how the tech world works, and then try some things on your own”.  

 

He was spot on. I had no idea at the time how the tech world worked, especially the fact that you should always heavily discount VC advice, but nevertheless, I went to work for Intel. 

Intel was an amazing place to work. Suddenly, I was understanding processors and operating systems on the transistor level, and I was very successful at this. I intended to only be 3 years at Intel, to just be able to put their logo on my resume but ended up staying for 13 years.

 

But entrepreneurship is a disease with no known cure. During my time at Intel, I started my first startup, Commentino.com.  

 

The initial idea was one place that automates all your posts on the web into a single blog, but that did not have a business model. Since we already had crowd content writers the business path was to have them write comments on your behalf, generating traffic to your website. Instead, we built an ads stock exchange. We had 10000 writers, a working business model, with a return of 50X per ad compared to Google. But we had major issues, it became a fake news machine, no matter how hard we worked to prevent it. In 2007, the fake news business was already alive and kicking and the machine worked great, customers asked to buy lies, or promotions and many of our writers were happy to deliver if the price was right. Cleaning that machine required a lot of machine learning (processing such operations with that amount of data was the knowledge that I did not have at the time) and money which we could not raise (2008 financial crisis). After failing with human-based cleaning (which was my first encounter with Amazon crowdsource tasking, mTurk,  which didn’t work well on complex tasks) We shut it down, but it was the first time I tried to develop a large-scale human-machine data expert system.  

 

Shortly after closing Commentino, I joined Smartap in 2009, as the 3rd co-founder and system architect. This time we were building the smartest shower in the world. The idea was simple: build a smart, connected shower that will save water (by a 30% reduction) by learning usage patterns from data. This would be accomplished by collecting and analyzing data usage and controlling the water flow/heating timing. I brought my cloud/data experience (AWS was so tiny back in 2007) and was excited to work on large-scale smart connected device systems and extract the meaning out of the data. Without understanding it, I became an IoT/Big data expert, while working on different types of data-driven systems. 

 

However, the new data bug started leaking into my Intel work. 

 

Around the same time, I was leading the CPU bug-hunting team at Intel for its leading products. Like the Intel Israeli site tends to do, we went for a very bold CPU design, Sandy Bridge. This CPU was the first to integrate GFx and memory controllers as many other features for data pipelines acceleration and processing, a major market leap that became an extreme bug analysis challenge towards the new processor launch, putting a major risk on the entire Intel roadmap.  

 

A single CPU bug takes several weeks of expert analysis (there were dozens of these experts in the world at any given time) and then 6 months to fix it. With an 18-month timeline, there was little room for mistakes. We had a million suspected issues globally every day, and experts analyzed 0.1% of that volume. Pretty significant crisis moment. 

 

With my expert team being only able to handle dozens of issues per week, a million a day put me to a point where I didn’t really have a choice, the usual playbook won’t fly. By that time, I already learned Python, some machine learning, and large-scale data processing and the power in connecting human experts into that flow. I took my top 3 experts and told them they will stop working on CPU analysis. “I need your wisdom a million times a day, not once a week.” They played ball. The principle was simple:  

 

We created something we like to call “signatures” (a kind of rule-based labeling scripts) where the experts code their knowledge and insights, trying to eliminate all known issues automatically, passing only real suspected matters to the rest of the team.   Within 3 months the system handled all issues with 99% automation and became the hero of the day along with the team that created it, 3 years after the system became standard at every intel lab. This was my first large-scale expert system building and scaling with success.  

 

By the year 2013 I started spending time on two main areas:  

 

Founding local Intel IoT accelerator (with the push of my manager, that embraced my crazy ideas and the partnership of my Dataloop co-founder, Avi Yashar) 

 

The first AI (deep learning) team building in Intel Development Center, Israel    

 

During the work on Intel’s IoT accelerator, I was exposed to hundreds of startups and technologies every year. I helped dozens of them scale their data operations, management, and analytics using Intel processors. With every case, I learned a bit more about the different data types, their potential business value, and the challenges in extracting this value using data.   

 

The AI team launch was a different story, I was watching AlexNet change in 2012 (the biggest deep learning breakthrough that started the current AI hype) and together with Intel’s global strategy office, we understood something big was coming. No one was willing to fund it internally back then (entering a room in 2013 and saying AI was making people laugh rather asking to join), so, I funded the team myself from my core processor team with some help from the Intel strategy office, again backed by my manager who got used to giving me creators freedom. 

 

I spent a few years on deep learning and IoT, learning that my passion for human expert data-driven systems was on the path to becoming a global major technological need. Together with Avi, I left Intel and start Dataloop.ai, a company that built a deep learning data platform, we founded Dataloop on 3 major pillars:  

 

  1. Data development 

 

The world is moving towards having a profession referred to as “data development”. This profession will have many sub-roles, methodologies, and processes and we should expect that every known software paradigm will have the equivalent data paradigm.  

 

  1. Data labeling by humans is at the core of AI 

 

The human experts, doing the data labeling are a critical part of the system intelligence, they are the source of truth in the system. The need for human intelligence is going to be essential for decades to come and successful learning systems will not try to replace humans but will work along with them.  

 

  1. Data loops are critical for intelligence and continuous learning. 

 

Intelligence is the ability to adapt to new conditions. Machines are not expected to have the ability to learn by themselves in the coming decade(s), which means humans will have to help along the way giving some guidance as reality(environment) changes. We can expect a future of machines processing data while continuously learning from humans’ feedback through the entire life cycle of the applications. We believed this principle is so powerful, so we named Dataloop after it.  

 

In the past 4 years, Dataloop’s amazing team has been dreaming, working, defining, labeling, and managing hundreds of datasets, on most of the verticals you can think of. 

 

I will share our experience in this book and hope you will find it useful.

 

I am sharing these learnings as we learn them ourselves:  

 

  1. This is probably the most comprehensive methodology book on the matter 

  2. It will be significantly different if I write it again in 3 years. Industries are born as we move.

Next Chapters

Chapter 1

Whenever I wish to deeply understand a topic I will always start with the bigger picture, as I believe it is critical to waterfall the “why” understanding along with the value chain. The why flows all the way from global trends, through to our workplace values, and finally to our daily work and life.

Chapter 2

I define a cognitive application as an application that can completely replace the collection of human cognitive actions, or the “thinking” part of a given work task or skill. In many cases these applications will start as human assistants, gradually replacing humans completely as they get more reliable and broader.

Chapter 3

So, data is critical for developing AI bots or cognitive applications, but that line of thinking can be misleading. The common phrase of “data is the new oil” is often used to express the value of data while ignoring the more important aspects of information and knowledge.

Chapter 4

AI development is essentially the process of collecting and organizing information. Data is collected, its meaning is extracted as information pieces, and then it’s structured into a format that allows future learning for the knowledge that these information pieces represent.

Chapter 5

The training process is the process in which we take our training set, i.e. the collection of data examples we’ve collected and create a model that learned from this example. We call this “training” and not “coding” since the model is created automatically from our data, with no coding involved. The result of our training session is a code we can then run that predicts its learned properties as a result of the new data.

Chapter 6

While often bias and variance terms are usually being discussed by data scientists and ML experts, understanding them requires no technical skills and is critical for anyone working with data-driven products, after all these are the data modeling bugs that will hurt our user’s experience and our product competitiveness. Time to gain deeper intuition on these concepts, no worries, you will understand them without a single equation involved.

Chapter 8

It is very popular to talk about machine learning these days while ignoring the teachers in this learning process, time to discuss the machine learning process from its less common perspective - as a teaching process.

Chapter 9

We are preparing to launch our AI app. We have basic models that are functional, we agreed with the pilot customer for a calibration period that allows our models to adjust to the fresh data and the data interfaces (APIs) with customers have been defined. In this chapter we will dive into the preparation and planning needed for launching and scaling our app deployment.