In general, I find that the value of advice given is usually strongly correlated to the domain expertise of the person offering the advice. So, before you take my advice, I’ll provide you with a bit more information about the path I’ve taken over the last few decades while I was working on these human/machine data-driven systems.
Let’s dive right in.
At the age of 12, I was already fixing PCs. First just as a hobby, and later as a gig. In 1995, I began to notice that some kids with internet-connected PCs had nude pictures in their browser history. I joined the Army in 1996 and returned to the computer scene in 2000. By that time, nude pictures were replaced by porn videos. In 2001, in my first year at the Technion-Israel Institute of Technology, I came up with a startup idea for internet safety software.
The idea was as follows:
Install a monitor application on PCs
The application will sample screen activity
An algorithm will analyze the result and raise alerts for suspicious or unsafe content
The suspicious content will be sent to India for human moderation (annotation), and we will improve the algorithm over time as we collect more examples and automate most of the work
I started assembling a bunch of students to join me and went to a VC (Venture investors). The very short conversation went as follows:
“The beauty of the internet is anonymity; do you want to destroy that? People will never install spy software on their PCs!”
“But the kids … you don’t understand what’s going on inside …” I tried to save the doomed conversation.
“Why don’t you go to work for a big company, like Intel. Learn how the tech world works, and then try some things on your own”.
He was spot on. I had no idea at the time how the tech world worked, especially the fact that you should always heavily discount VC advice, but nevertheless, I went to work for Intel.
Intel was an amazing place to work. Suddenly, I was understanding processors and operating systems on the transistor level, and I was very successful at this. I intended to only be 3 years at Intel, to just be able to put their logo on my resume but ended up staying for 13 years.
But entrepreneurship is a disease with no known cure. During my time at Intel, I started my first startup, Commentino.com.
The initial idea was one place that automates all your posts on the web into a single blog, but that did not have a business model. Since we already had crowd content writers the business path was to have them write comments on your behalf, generating traffic to your website. Instead, we built an ads stock exchange. We had 10000 writers, a working business model, with a return of 50X per ad compared to Google. But we had major issues, it became a fake news machine, no matter how hard we worked to prevent it. In 2007, the fake news business was already alive and kicking and the machine worked great, customers asked to buy lies, or promotions and many of our writers were happy to deliver if the price was right. Cleaning that machine required a lot of machine learning (processing such operations with that amount of data was the knowledge that I did not have at the time) and money which we could not raise (2008 financial crisis). After failing with human-based cleaning (which was my first encounter with Amazon crowdsource tasking, mTurk, which didn’t work well on complex tasks) We shut it down, but it was the first time I tried to develop a large-scale human-machine data expert system.
Shortly after closing Commentino, I joined Smartap in 2009, as the 3rd co-founder and system architect. This time we were building the smartest shower in the world. The idea was simple: build a smart, connected shower that will save water (by a 30% reduction) by learning usage patterns from data. This would be accomplished by collecting and analyzing data usage and controlling the water flow/heating timing. I brought my cloud/data experience (AWS was so tiny back in 2007) and was excited to work on large-scale smart connected device systems and extract the meaning out of the data. Without understanding it, I became an IoT/Big data expert, while working on different types of data-driven systems.
However, the new data bug started leaking into my Intel work.
Around the same time, I was leading the CPU bug-hunting team at Intel for its leading products. Like the Intel Israeli site tends to do, we went for a very bold CPU design, Sandy Bridge. This CPU was the first to integrate GFx and memory controllers as many other features for data pipelines acceleration and processing, a major market leap that became an extreme bug analysis challenge towards the new processor launch, putting a major risk on the entire Intel roadmap.
A single CPU bug takes several weeks of expert analysis (there were dozens of these experts in the world at any given time) and then 6 months to fix it. With an 18-month timeline, there was little room for mistakes. We had a million suspected issues globally every day, and experts analyzed 0.1% of that volume. Pretty significant crisis moment.
With my expert team being only able to handle dozens of issues per week, a million a day put me to a point where I didn’t really have a choice, the usual playbook won’t fly. By that time, I already learned Python, some machine learning, and large-scale data processing and the power in connecting human experts into that flow. I took my top 3 experts and told them they will stop working on CPU analysis. “I need your wisdom a million times a day, not once a week.” They played ball. The principle was simple:
We created something we like to call “signatures” (a kind of rule-based labeling scripts) where the experts code their knowledge and insights, trying to eliminate all known issues automatically, passing only real suspected matters to the rest of the team. Within 3 months the system handled all issues with 99% automation and became the hero of the day along with the team that created it, 3 years after the system became standard at every intel lab. This was my first large-scale expert system building and scaling with success.
By the year 2013 I started spending time on two main areas:
Founding local Intel IoT accelerator (with the push of my manager, that embraced my crazy ideas and the partnership of my Dataloop co-founder, Avi Yashar)
The first AI (deep learning) team building in Intel Development Center, Israel
During the work on Intel’s IoT accelerator, I was exposed to hundreds of startups and technologies every year. I helped dozens of them scale their data operations, management, and analytics using Intel processors. With every case, I learned a bit more about the different data types, their potential business value, and the challenges in extracting this value using data.
The AI team launch was a different story, I was watching AlexNet change in 2012 (the biggest deep learning breakthrough that started the current AI hype) and together with Intel’s global strategy office, we understood something big was coming. No one was willing to fund it internally back then (entering a room in 2013 and saying AI was making people laugh rather asking to join), so, I funded the team myself from my core processor team with some help from the Intel strategy office, again backed by my manager who got used to giving me creators freedom.
I spent a few years on deep learning and IoT, learning that my passion for human expert data-driven systems was on the path to becoming a global major technological need. Together with Avi, I left Intel and start Dataloop.ai, a company that built a deep learning data platform, we founded Dataloop on 3 major pillars:
Data development
The world is moving towards having a profession referred to as “data development”. This profession will have many sub-roles, methodologies, and processes and we should expect that every known software paradigm will have the equivalent data paradigm.
Data labeling by humans is at the core of AI
The human experts, doing the data labeling are a critical part of the system intelligence, they are the source of truth in the system. The need for human intelligence is going to be essential for decades to come and successful learning systems will not try to replace humans but will work along with them.
Data loops are critical for intelligence and continuous learning.
Intelligence is the ability to adapt to new conditions. Machines are not expected to have the ability to learn by themselves in the coming decade(s), which means humans will have to help along the way giving some guidance as reality(environment) changes. We can expect a future of machines processing data while continuously learning from humans’ feedback through the entire life cycle of the applications. We believed this principle is so powerful, so we named Dataloop after it.
In the past 4 years, Dataloop’s amazing team has been dreaming, working, defining, labeling, and managing hundreds of datasets, on most of the verticals you can think of.
I will share our experience in this book and hope you will find it useful.
I am sharing these learnings as we learn them ourselves:
This is probably the most comprehensive methodology book on the matter
It will be significantly different if I write it again in 3 years. Industries are born as we move.