Mar 15, 2023 · 10 min read

First-Mile vs. Last-Mile AI Systems in the Era of Foundation Models

We’re excited about foundation models (FMs) because they will change how we build AI systems. This document frames where we have found FMs to be useful in AI systems, where traditional machine learning (ML) fits in, and where exciting gaps still remain. One rough and ready framing is:

FMs are a first-mile technology, while traditional machine learning is a last-mile technology.

Of course, this is overly simplistic. This article will unpack if this framing is useful, and put into context some of our research directions — and we’d love to hear about yours!

FMs as a First-Mile Technology

FMs have an unbelievable out-of-the-box feel. Right now, FMs are a great interactive, human-in-the-loop technology (e.g., ChatGPT). Their interface dramatically improves a person’s ability to formulate and refine a question, brainstorm or rewrite copy, quickly answer factoids, and conduct other search-like tasks. We evaluate these models' answers by how fluent, plausible, surprising, or just flat out interesting they are. This is way beyond search and exploration!

On the other hand, much of machine learning in production is evaluated on finely sliced quality metrics, the ability to maintain and predictably improve quality in the face of changing data, and the runtime performance and cost to achieve that quality. On these metrics, traditional machine learning is often still better than the current FMs by orders of magnitude [1].

As an academic, the natural question arises: is the best of both worlds possible?

The State of Play

Fundamentally, the radical change in FMs was focusing on generative ability, rather than honing in on a small number of tasks. This is a significant departure from conventional ML wisdom in which one narrows a task until each step is so precisely specified that you can obtain incredibly high quality. This is kind of computer science 101: refine the specification and decompose the problem with modularity—but FMs go in exactly the opposite direction. This leads to one of my favorite recent paper titles, lightly edited:

FMs are the jack of all trades, master of none.

Right now, FMs are generalists, but they top out in quality on specifics. As Yoav Goldberg put it in his succinct way: "why is 85% good?"

Machine learning has offered tools and techniques to obtain supervision and refine errors, but current techniques on FMs can make these expensive or unwieldy. Now, one optimistic (and our academic) take is to think about the qualities of a single approach that combines the best of both worlds. But, it’s not at all obvious that this unification can happen.

Two Worlds Colliding

The current zeitgeist (and our research program!) is to try to improve FMs. From an academic perspective, FMs are the newer technique and so a more appropriate place for academia to solidify our understanding. But one of my favorite questions about a research program is: Why might we fail? Why might it not make sense to unify these different worldviews?

A few reasons come to mind:

A technology split has endured in prior decades (i.e., search & databases).

Search (e.g., Google, Elastic) is magical in many applications, but it didn’t replace relational databases–or vice versa! Search has many miraculous properties, but you don’t use it for your bank account or to answer aggregate questions in business (e.g., How much did we sell? Who is likely to churn?). These questions require precisely aggregating and analyzing a large number of facts.

It’s worth pointing out their quality goals are different: traditional IR systems focus on subjective measures like relevance. This is fantastic for human interfacing applications, but that’s not all the applications on the planet. Classical databases are part of automated pipelines, so correctness is formally and precisely specified.

These two technologies have coexisted for decades, in spite of lots of fascinating research to bridge the two. In some sense, the search systems were for solving "first-mile problems", which involved finding out what information is available and what has been summarized, while database systems were designed for resolving "last-mile problems".
Closing error gaps is harder than it might first appear.

To go from a first-mile technology to a last-mile technology, we’re asking for orders of magnitude improvements in error reduction for FMs. Going from 85% to 99% means reducing error by a factor of 15!

We’ve seen this cognitive bias play out in areas like self-driving cars where you have human error rates like one accident per 500M miles in which case this is a 1,500,000 times reduction in error [2]. Most folks like me have difficulty putting these orders of magnitudes into context, but it’s a big gap from working to working reliably.
When are the costs of fluency and generality worth it?

If you have a specific task you can build a model that is 100 or 1000x times more efficient than our current FMs. If you are willing to reformulate questions a few times, you can get away with 15x or smaller models [3].

More glibly, when you’re classifying loan documents do you care if your model can generate fun limericks? Do you care if it can generate diverse answers, if you’re doing expense report processing? Personally, I think the world would be more fun if point-of-sale devices could summarize French literature, but I certainly won’t be the one paying for that capability. This suggests there are some fun tradeoffs between model size, evaluation, and cost.

Now, all this said, FMs seem much more powerful than traditional search so I think we’re right to be optimistic. FMs are improving, and our research goal is to understand and hopefully eliminate these gaps–but we also see it’s possible that these technologies coexist in modern AI systems for some time.

It’s worth pointing out that FMs are still proofs of concept, and we’ve seen much smaller models that train on much more data (LLaMA, Stanford Alpaca) that are able to reduce some of these costs–not yet by orders of magnitude, but hopefully soon!

History Rhymes and Data Still Matters

Training in purely self-supervised systems was a massive leap forward. But the most recent leaps in FMs look downright classical: they all come through things that look like (weakly) supervised data (asking humans for feedback — RLHF) or massive multi-task instruction training (FLAN-T5), or driven by honey pots of data collection (OIG Dataset, SHP Dataset). My prior was that FMs would drive us further into data-centric AI (blog), as currently, the only real way to program an FM is via its data.

Data programming is still required. This is not surprising: the data for the domain encodes both the problem and expertise about it. As a practical matter, it seems to be far easier to change data than code in predictable ways. Even the largest labs seem to rarely change their code, but they’re iterating rapidly on the datasets. As we said at the start of the data-centric AI era (informed by the Snorkel folks): data is the primary way to program AI systems. Although I’m biased, this is partly why I’m such a strong believer in the lasting power of systems like Snorkel as a foundation model operations (FMOps) platform to help manage that shift from “first-mile” to “last-mile” – and with it the 100x performance gains and quality management. Such an exciting time!
Generality and expertise mean your data is more valuable than ever. You wouldn’t want your doctor to be a smart undergraduate with no training, and you wouldn’t want someone who read medical school textbooks but had dropped out of highschool as your doctor. You need both¹ ! FMs in some sense are the great liberal arts education, but you still need to specialize it to your particular area of interest. If you’re a company, your unique way of doing things is contained in your data–and likely no one else’s!

What’s next? There is so much to do, and it’s a great time to be in academia. As an aside, my view is that academics should be optimistic about how this technology has changed and advanced our understanding of what’s possible. It’s a positive-sum game, companies have been financing amazing work in AI for us all to learn about what’s possible–and academia has reciprocated with great students and ideas. Let’s keep it going! There are exciting lines of work to pursue and gaps to understand. Here are some of our recent ideas, and we’d love to hear yours!

Best of both worlds? We’ve done some work on how to combine FMs and weak supervision (Liger, AMA, and FM-in-the-loop WS by Snorkel). It’s allowed us to get to new state of the art results more quickly and with dramatically simpler models!
Making FMs smarter and easier to tune. Longer contexts make FMs smarter (blog, Hyena), and allow them to ingest many more examples. Maybe the future of weak supervision systems is to generate k-examples to feed to these prompts? Who knows!
Making FMs efficient in new ways. FMs are inefficient in high-throughput mode, and we’ve recently released tools (FlexGen) to help there–but there is a lot more to do. Maybe we can bring down those numbers 1000x with even more tricks?
FMs beyond human-in-the-loop systems? How do we use FM tools in traditional data aggregation? We’ve already seen they can help with many data preparation tasks and others, but these were already closer to the search/IR view of the world in some ways. What about answering traditional queries? Checkout DSP and Symphony for some early work!
FM-first tools for data work? We’re exploring how FMs can be used in a traditional dataframe workflow, with a dataframe API that runs FMs to perform summarization and data analysis. What new visualizations will data workers need to audit and work with FMs? Check out our open-source Meerkat project!

We don’t yet know the costs and boundaries for these technologies, but it’s exciting and worth trying to put them into context. If you have better context and analogies, please post them! I think it’s worth trying to frame these things in a way that may be helpful for what comes next!

Acknowledgements

Thank you to Alex Ratner for his insightful comments and feedback. Also thanks to Karan Goel, Avanika Narayan, Dan Fu, Sabri Eyuboglu and Eric Nguyen for their comments and contributions to this post.

h/t PD Singh of SambaNova for this framing!↩