Apr 18, 2023 · 8 min read

Ask Me Anything: Leveraging Foundation Models for Private & Personalized Systems

A key promise of machine learning is the ability to assist users with personalized tasks, which can range from common tasks like email classification to niche tasks like “write a song based on my text messages”. Because our personal data, for instance our emails and texts are often private, our personal system needs to provide strong privacy guarantees, high quality, and be easy to use/low-cost. We have been excited about how foundation models are changing the classical dividing line between privacy and personalization over the past couple of years.

Privacy and quality in tension.

Low training data regime. Because most users lack sufficient data to train robust personal models from scratch locally, most of the effort in building private-personal systems is focused on:

Training over the private data of multiple parties that are interested in the same task, referred to as federated learning (FL). FL is often motivated by common tasks like text-message autocomplete, rather than niche tasks where it can be difficult to assemble cohorts of users. Intuitively as information is being shared by multiple parties to train the model, users need to sacrifice some privacy under FL.
Leveraging off-the-shelf public resources (i.e. models and data) and locally modifying them for the personal setting. Intuitively these solutions are fully private — the user takes public resources but gives away no private information — but the classical mechanisms for updating the public resources, i.e. training on the small amount of available private data, are quite brittle. ¹

Retrieving relevant knowledge at inference time. Beyond training for on local or pooled data, many ML systems are designed to incorporate massive amounts of dynamically changing information at inference time. For instance, question-answering and personal assistant systems, and retrieval augmented language models must handle a broad set of user-inputs, so approaches that explicitly retrieve relevant information from a data store at inference time tend to drastically outperform ML-approaches that do not use retrieval. Unfortunately, existing work on open-ended retrieval assumes information is being retrieved from a single data store, instead of a realistic mix of private (e.g. company code repositories, emails, medical records), and public (e.g. Stack Overflow, Wikipedia, and PubMed) resources. We lack effective ways of handling privacy-quality tradeoffs when it comes to operating over multiple privacy scopes at inference time. For instance, systems such as Github Copilot are seeing widespread adoption — to use these, users sacrifice privacy! As users write private code in their private repositories, the data is shipped off-device to a third party that provides code-completion assistance.

Users have heterogeneous resources & many users are resource-limited. Some users are large resource-rich organizations, while others are individuals with laptops and phones. The more we reduce model size, training and inference costs, and communication requirements, the more accessible the system becomes. In parallel, the hardware capabilities of user-devices are increasing at rapid rates!

Foundation models for private and personal systems!

We’ve been so excited about the potential of foundation models to help develop the next generation of personalized ML systems. Foundation models are massive ML models trained on broad web data in a self-supervised manner. Recent FMs demonstrate an emergent ability, called in-context learning, wherein they can adapt to new tasks at inference time, simply conditioned on a natural language description, or prompt, of the task of interest.

Personal tasks in a low training data regime. In Can Foundation Models Help Us Achieve Perfect Secrecy? presented at AAAI Privacy Preserving AI '23, we were the first to benchmark in-context learning against popular legacy privacy approaches like federated learning on the canonical personalized ML benchmarks in the privacy community. In this work, we excitingly found in-context learning at inference time to be competitive with trained baselines across varied language and vision tasks. Further, a system architecture based on locally performing in-context learning to perform personal tasks would require revealing no personal information whatsoever, providing a stronger privacy guarantee than federated learning with differential privacy.

But when we proposed this vision, it was quite a controversial view and is still underexplored in the privacy community compared to approaches like federated learning and differential privacy. For instance, Considerations for Differentially Private Learning with Large-Scale Public Pretraining states: "Arora and Ré [AR22] further argue that LLMs can be personalized to each individual’s personal and sensitive data while incurring no privacy cost (i.e., $\epsilon = 0$ ) by leveraging the pretrained model’s “zero-shot” abilities. This line of work suggests we are getting close to “solving” private learning. Indeed, as Web-scraped datasets grow larger and larger, the ability of pretrained models to privately adapt (“for free”) to new tasks will only get better. We challenge this view, and critique the public-pretraining and private-finetuning paradigm."

What are the main challenges and critiques for our vision?

Large-scale benchmarking efforts like HELM from Stanford's Center for Research on Foundation Models show the best foundation models under in-context learning are massive and closed-source. How many organizations and users can locally, privately train and use foundation models?
Do foundation models only help on personal tasks that are similar to the foundation model pretraining distribution?
Is the training data being used to train foundation models truly private?

We’ve been energized by how quickly these challenges are being addressed by the community! In our research:

Addressing the resource bottlenecks to improve accessibility. First, we studied how to improve the effectiveness of small, open-source LLMs under in-context learning to improve accessibility. In our ICLR 2023 spotlight work Ask Me Anything: A simple strategy for prompting language models, we developed a method called AMA, enabling an average performance lift of 10.2% over the prior in-context learning baseline across open-source model families (e.g., EleutherAI, BLOOM, OPT, and T0) and model sizes (125M-175B parameters). This simple strategy (detailed further in this post) enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B, OpenAI’s popular proprietary model, on 15 of 20 popular benchmarks that were used for evaluations in the original GPT3 paper. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B.

Meanwhile, in Evaporate, we showed how to asymptotically reduce the cost of processing data with foundation models for the task of generating structured views of heterogeneous, semi-structured data stores (e.g. webpages, emails, FDA and medical reports). Averaged across 16 real-world data stores of 10k documents each, our method gives a 110x reduction in the number of tokens the foundation model needs to run inference over, in order to complete the task.

Narrowing in to the 5 of 20 tasks on which the small open source models did not outperform GPT-175B in our AMA study, we found these were the tasks that required the model to memorize a massive amount of factual (temporally changing) knowledge about a broad range of topics. We focused on addressing this bottleneck in our concurrent work:

Retrieving relevant knowledge at inference time and supporting tasks that need information beyond the pretraining distribution. In our TACL 2023 paper Reasoning over Public and Private Data in Retrieval-Based Systems in collaboration with Meta AI, we were the first to study the problems of (1) retrieving over multiple privacy scopes and (2) retrieving over multiple data distributions (intuitively, public and private data come from different data distributions) for open-domain applications. We identified and defined the privacy concerns, constructed the first textual open-domain QA benchmark to require retrieval over multiple data distributions to study the two proposed retrieval problems, and evaluated state of the art retrievers in the proposed setting.

Finally, an important concern is whether foundation model pretraining data is being collected through ethical and legal processes. This is top of mind and the research community is assembling data resources to improve future model versions --- checkout Together's Red-Pajama and Stanford's Human Preferences datasets. These efforts are further complementary to lines of research on differentially private pretraining.

We are just beginning to see broader excitement and feasibility around foundation models for privacy and personalization with some lag after our initial proposal, and there’s lots more to be done. We’d love to hear your thoughts and feedback on these research directions. Feel free to reach out at simran@cs.stanford.edu!

See recent work by Virginia Smith’s group at CMU showing that state-of-the-art algorithms for personalized federated learning algorithms can significantly worsen performance, compared to ignoring the private data altogether and just using the public resource (model) off the shelf!↩