Feb 28, 2020 · 5 min read

Software 2.0 and Data Programming: Lessons Learned, and What’s Next

Dan Fu, Laurel Orr, and students of HazyResearch

Four years ago, the idea that you could use a statistical process to help people label training data from noisy sources sounded absolutely crazy. But since then, the ideas that we formalized into data programming and weak supervision in Snorkel have had a surprisingly broad degree of impact. In fact, with help from our awesome collaborators, there’s a good chance that in the last five minutes, you’ve used a product that uses weak supervision -- Gmail, AI products at Apple, and search products at Google all use it! We’re also happy to see that these ideas have helped democratize access to the modern machine learning stack -- their adoption not only by big technical players in industry but also by domain experts like radiologists and journalists has been beyond anything we expected. Those collaborations have been critical and fun, and we couldn’t have done this work without them!

But this is just the start of the journey. We’re taking some time to reflect back on the state of the world four years ago and the lessons we’ve learned since then, and we’re more excited than ever for the next evolution in how people build software.

Chris talked about some of these ideas in his keynote at MLSys 2020 -- check out his slides for a summary! (pdf | pptx)

The Snorkel Journey: It’s the Data, Not the Models

At the start of our journey with weak supervision, we thought that there was about to be a fundamental shift in how people build machine learning systems. Four years ago, everyone was focused on how to build better and better models, but we made a bet that much of the deep learning modeling stack was going to become a commodity. Engineers would focus their time on what you put into the model instead of what the model was made of.

We realized a major bottleneck in that process would be getting training data. So we built Snorkel to manage the training data acquisition process. In a world in which people would increasingly specify program behavior through training examples instead of code, we reasoned that building and managing training data would be critical. We started out by calling this paradigm “data programming” but eventually migrated to the (much better) name Software 2.0 after Andrej Karpathy wrote his blog post and visited the lab.

We’ve been really excited to see Snorkel get adopted, from the multiple industrial deployments to uses in health care. We think this speaks to a need for this type of work, and we’ve been excited to play some role in it with our collaborators -- the first users were critical in helping us understand the problem space, and we wouldn’t be in the same place today without their patience and trust. Thank you! It’s also been really interesting to study both from the ML and systems perspective as well!

The Undiscovered Country

However, we’re really still just at the beginning of these explorations. Labeling training data is only one part of the model creation process. There’s a lot of work invested to get these machine learning models to work in the real world, and very little of it is being modeled or understood in a formal way.

Here are a few new directions that we’re just starting to get into. As they grow and develop, we’ll be particularly excited to get feedback about these ideas from our amazing colleagues and collaborators:

Data Augmentation: Machine learning engineers still use manually-tuned data augmentations to encode invariants into their data -- we’ve taken a few small steps in modeling and understanding this process, but we’re still in the very early stages (check out our series on this topic for a deep dive).
Observational Supervision: We are only at the very beginning stages of thinking about how to use even weaker, completely passive forms of supervision like eye tracker data.
Model Validation: Model validation and maintenance are critical parts of the model deployment pipeline but are still poorly understood, especially on critical slices of data.
Hidden Stratification: As machine learning models get deployed in more real-world applications, we may need to be wary of problems like hidden stratification and develop methods to help us build better confidence in our models.
Understanding Embeddings: Pre-trained embeddings are ubiquitous in machine learning deployments, but questions like when embeddings should be used, how exactly they provide lift, and what geometries to use for different applications are still poorly understood.

As a research group, we couldn’t be more excited by all these research questions -- every time there’s something we don’t understand, we know that it’s an opportunity to explore better ways for people to build machine learning systems. We’ve been very lucky to work with great collaborators, and their feedback has been critical in helping us develop and validate our ideas. We are so excited to continue working with them to help figure out what the next generation of machine learning systems looks like.