The Coming Wave of ML Systems

Chris Ré, Piero Molino, Dan Fu, Karan Goel, Fiodar Kazhamakia, and Matei Zaharia

AI and ML products now permeate every aspect of our digital lives--from recommendations of what to watch, to divining our search intent, to powering increasingly-present virtual assistants in consumer and enterprise settings. While quality improvements are the main focus of traditional ML and AI research, a second and arguably less well understood benefit of machine learning is that it can dramatically reshape the practice of building applications. With an eye toward generations of compiler, database, and operating systems work, they may inspire new foundational questions for how to build the next generation of AI-powered systems.

Tools are important. They are the scaffolding of the machine learning revolution: the widespread adoption of tools like PyTorch and TensorFlow (building on earlier academic prototypes like Theano and Torch) enabled users to more easily assemble models due to both well-suited domain-specific languages and a rich collection of building blocks. Supported by large companies, these tools have spawned a rich ecosystem to which new building blocks are contributed almost daily and which even contains tools for deployment (eg TFX and TorchScript). Moving from the era of bespoke AI tools to a shared communal foundation has seen stunning productivity gains--on a personal note, it was wild to live through and modestly contribute to.

With the stunning success of these platforms, these libraries have moved the pain point for engineers who build and maintain these products. To understand what might be next, perhaps we can take a page from computing history? One view is that the current generation of tools are akin to software libraries, but they lack some of the features that distinguish long-lived computing systems, such as:

  • monitoring and lifecycle management (most ML systems only deal with training monitoring),
  • support collaboration of all stakeholders around the life of the product (most ML systems lack a model management solution),
  • end-to-end data flow debugging and monitoring (most ML systems don't manage training data production pipelines)
  • ... and many more ... Understanding this thought has been a driving force behind our recent work. We presented some of our initial ideas in the MLSys keynote and described some of our thoughts for production and research systems.

While we contend entirely new ways of building these systems are possible, we are at the start of this journey. There is preliminary evidence that there’s something here: these new breed of systems have found their way into industry products used by billions of people every day like Google [data programming, information extraction], YouTube [multi-modal], multiple Apple products [Overton], Uber [customer support, food recommendation, Ludwig open-sourced], and many more.

The goal of this post is to introduce the Stanford MLSys Seminar Series to hopefully engage more of the community around ideas to build these systems. If you’re interested in this area or you have a topic you’d like to see, let us know! Please visit the webpage at mlsys.stanford.edu to see our preliminary thoughts and the schedule of our first speakers. We welcome your feedback!

One outcome of the course is to articulate the challenges that we’ve seen, solicit challenges from the community, and try to make the field more accessible for academic research. If we’re lucky, we may just help to spawn the next major subfield of computer science!