September 27, 2021

What Data‑Centric AI is Not

In the recent buzz around data-centric AI, we see some confusion about what exactly data-centric AI is and equally importantly, what it’s not. Since we work in the space and try to learn as much as we can about other work in the community --- please engage and contribute! (Data-Centric AI resource¹) --- we want to share our perspective on the topic. Of course we don’t claim to know perfectly, please send notes so we learn more!

First things first, the importance of data is not new --- there are well-established mathematical, algorithmic, and systems techniques for working with data, which have been developed over decades. What is new is how to build on and re-examine these techniques in light of modern AI models and methods --- just a few years ago, we did not have long-lived AI systems or the current breed of powerful deep models. The exciting opportunity is that with the rich set of techniques, it may be easier to make theoretical progress in AI by reasoning about data, instead of model architectures (e.g., number of layers or dimensions). But it's not “either-or“; data-centric AI does not mean we should stop paying attention to models, there's a balance. It's about bringing focus to how data can shape and systematically change what a model learns, sometimes with even more impact than architectural changes.

Focusing on the data is not new

People have noted the unreasonable effectiveness of data since machine learning prehistory. That data is an important record of human activities is not a novel observation. A key moment for me was reading about the world-wide telescope project from the Turing-Award Laureate Jim Gray, which became the Sloan Digital Sky Survey. Jim’s basic observation: data from all the telescopes hits hard disks. If we strap all those hard disks and data together, we could have the world’s biggest and best telescope. Data FTW… in 2002!

So data being important isn’t new, but even in the narrower sense, many of the techniques in data-centric AI are not new. Let’s look at few of the most important techniques:

Data of questionable quality can still be quite functional!

Using lower quality, or “silver-standard”, training data goes back to at least the 90s. Personally, I was inspired by Hearst patterns, Jurafsky and Mintz, and many more (please NLP-people-who-love-to-source-things send more pointers).
I just learned that the ideas of data augmentation arguably go back to Rosenblatt. They feature prominently in work from the 90s as well.
Monitoring your data to discover errors has been standard practice for decades, and the practice builds on feature selection ideas.
Popular techniques in active learning again date back at least to the 90s.

Adjacent fields like data integration and data management have tons of expertise we can build from!

And fields adjacent to ML, such as data integration and data management, have long studied techniques and developed systems for data visualization, discovery, summarization, and more.

Even if techniques for working with data are not new, they can still have an impact. Deep learning wasn’t a new idea in 2014, but facts on the ground changed (like the hardware and scale of available data) and its revival had a massive effect. So why might now be the time for data-centric AI?

Why now? New facts on the ground.

The balance between different avenues of progress swings back and forth

The balance between model-centric and data-centric progress swings back and forth over time. It’s nuanced, and there isn’t a need to have artificial hot takes about which to focus on. In recent years, the development of powerful AI models was thrilling! I am a builder first, and as a result of very smart people, I could help build awesome demos and services that billions of people use. I’m infinitely grateful for those amazing ideas embodied in models!

There isn't a need for artificial hot takes!

But the practical pendulum may be swinging back for a key reason: improving quality by inventing a new model is more difficult than improving by using techniques that manipulate the data. We have decades of techniques for reasoning about data; we’ve made much less progress in understanding neural network architectures.

So great, focusing on the data is a path forward. But if we have our same old techniques to work with data, why is there so much excitement around data-centric AI now? Why is it challenging and interesting? Well, the facts on the ground have changed in important ways.

Modern models can offer much better performance out-of-the-box, but as they are applied to a wider set of domains and more complex problems, we need to understand what they struggle to learn automatically. Often, with domain knowledge about our problem, we can discover the failure modes of our ML systems, but the question of how we fix the system most effectively remains. Should we label data? Write some code --- you could try to constrain the model (and we tried), but it very rarely worked as well as encoding that domain knowledge in the data instead. In some sense, the premise of data-centric AI is that the most effective way to build and maintain domain knowledge is a process around the data, not the model code. It’s data management not software development.

AI systems are not static (like paintings!), but are long-lived and require maintenance.

AI has moved from isolated applications to software artifacts, i.e. systems, we use and maintain over time. This is similar for example to how banking software moved from a complex combination of disjoint code to simpler solutions like relational databases, allowing us to focus on the data in the application, not the code around it. In this view, data-centric AI is just common sense engineering, requiring us to setup processes for monitoring and improving quality. The challenge is that we don’t have any procedures, norms, or theory --- anything you’d expect of an engineering principle. And they didn’t exist because up until recently, we didn’t need them, it wasn’t that long ago that even flagship applications like Google search were entirely manual. So the central challenge of data-centric AI is to develop the theory, algorithms, and norms around how to manage the data for AI Systems.²

Then data-centric AI is about techniques to make the process of building and then maintaining an AI system over time a little less awful? As far as I can tell, all of systems research is just making things less awful. And on the systems note, again, data-centric AI is not about reinventing the wheel; there's lots of expertise in adjacent systems fields to tap into around working with data at scale and managing the data lifecycle.

Why might data-centric AI be an interesting intellectual bet?

Our bet is that many data-centric AI principles are going to be discovered in the next few years. Why might this be an interesting bet?

We've got a rich set of tools for data, decades worth!

Technical push. Perhaps too technical for this high level discussion, but statistics gives us decades of theory about how to handle the key objects in data using sophisticated mathematical tools like data distributions and their associated geometry. We lack those tools for reasoning about neural networks and so it might be easier to make theoretical progress by reasoning about the data.
- This is a major change --- when areas of math came up with problems so sophisticated they couldn’t describe the optimum, they changed radically (e.g., PDEs). We can lean on tools with a basis in data to make headway on complex problems in ML.
Application Pull. As modern ML models deliver increasingly impressive performance out-of-the-box, they are being leveraged by non-ML experts in an ever expanding set of domains. When these models fail in minor or major ways in these new settings, it’s important to have the domain expert in the loop, incorporating her knowledge to improve the system. Notably, domain experts are much more familiar with the data than the model details and can iterate on the data in in a few ways.

Domain experts might find the data more familiar to change than the model code

A domain expert might have a bunch of rules or facts about her domain, which can be useful to leverage in the ML pipeline. For our own part, the initial Snorkel project tackles this problem by combining and building on previous technical ideas, and delivering a system that makes it easy to incoproate expert-specified rules in the in the dataset construction process.
Domain experts can also identify when models are failing. To make this process systematic and efficient, we require tools to help the experts design, analyze and monitor slices of data. We also need techniques for updating the ML pipeline based on expert feedback. Our own work tackles such problems in model evaluation from algorithmic and systems directions.

ML models are being leveraged in an ever expanding set of domains!

Consider one example from our work on weak supervision in Snorkel. Statistical inference is about combining different pieces of uncertain information to reach a conclusion; Snorkel’s first idea was to apply this idea to the training data collection process -- to model the process and protect against certain errors. A major issue we saw was that people needed to combine different sources of less than perfect supervision that each had different levels of quality, and often the errors in these sources were correlated with each other in complex ways. So we developed a theory of when we could learn the quality and correlations of disparate data sources -- without requiring any labeled data. We could prevent overconfidence in individual bad sources and prevent several highly correlated low quality sources from overwhelming good evidence. For ML nerds, the theory is based on latent variable graphical models and uses tools like Effective Rank from Verhshynin to get the right scaling with data (checkout these notes to go deeper). To me, these are foundational ideas: classes of mathematical models that capture the training process to protect against errors (this is what statistics is for!). Snorkel was much more than the theoretical nuggets, it allowed us to think about how to elicit domain knowledge for the problem of training set construction and maintenance.

Data-Centric AI involves incorporating domain knowledge in a systematic and efficient way. (Image from Saab et. al., 2020)

More broadly, our conception of data-centric AI revolves around these ideas: how should we best encode knowledge in data? How do we develop engineering norms around that data to prevent errors and improve quality? Are there entirely new techniques of how people can specify their knowledge more efficiently, and how do these techniques depend on the model being used? Data-centric AI is not just tweaking data and observing model improvements or a collection of ad hoc tricks; it’s understanding the new mathematical, algorithmic, and systems foundations of these methods. The last few years have seen deeper data driven understanding of weak supervision, data augmentation, what representations learn and more. These are foundational intellectual areas that use the idea that we analyze the data rather than code or the model process.

Data-centric AI is conceptually rich. At its best, it changes your point of view of where an engineer should spend her time. Our view will certainly be informed by work both in academia and industry. Fundamentally, data-centric AI is an approach to building the next generation of long-lived AI systems.

^{1. Here is an evolving collection of content from the data-centric AI community, which might be useful if you're looking to learn more about the space. One goal in this post is to think more about what these works build on.↩}
^{2. This has been part of our lab’s own quest to make AI into an engineering discipline. We’re certainly not alone, checkout Mike Jordan’s great talk at the first MLSys.↩}