In The ChatGPT Era, Your Data is More Valuable Than Ever

Chris Ré.

Understanding where ChatGPT, Large Language Models and Foundation Models might go and why they matter is the question on everyone’s mind in industry and academia. The next generation of these models will know more about your business and you. As Alex Ratner puts it, we’re moving from GPT-X to GPT-You. But how might we get there? Why do people think this is possible now? If data is so important, why do I only hear about models? What are the main trends to keep an eye on? Why are we so confident that the value is in your data?

From Models to Data.

One way to understand what happens next is to view where the valuable raw materials are today, and when those materials become widely available, where the next scarcity develops. For many years, AI was model centric. This changed in the GPT era–while there are many differently branded models, they are all the same class of model underneath (or architecture in nerd parlance). These models are standard, the platforms to train them are standard, and while there will certainly be modeling advances, their quality depends on only one asset: data. This was not the case even a few years ago: you needed specialized teams with deep knowledge of AI folklore and the right tattoos to successfully build these models. This commoditization is happening even more rapidly in the last few months with open-source models.

So the scarce quantity is data–insight about and knowledge of your business. But it’s not the expressly created manual survey data of the past, it’s data that is effectively the exhaust of the enterprise. In classical machine learning, simply massaging this data into the right form was a herculean task. As a result, only the cleanest data could be used for machine learning, and usually at eye watering expense. Now, we can learn from a huge array of digital exhaust that is created (and cheaply captured) with every interaction between you and your customer, every invoice, chat, email and more – we can now learn from all of it without much manual intervention. This shifts the challenge to figuring out which of this ocean of data you should use, for what purpose, and how to measure the success of your task–all in the most cost effective way possible.

More succinctly, data in artificial intelligence has moved from being important, to being central. There are a pair of drivers:

  • It’s Easier Than Ever to Build Models. The ingredients to build a foundation model are well established and easy to follow recipes. As I noted before, while most science struggles with a reproducibility crisis, we have an anti-reproducibility crisis in foundation models: it might be too easy to build these models. We think this trend will only accelerate due to widely available high quality open source models and recipes.
  • Many AIs, Not One Monolithic AI. Here is one reason based on data why our view is that there won’t be one GPT to rule them all, and the world will have many AIs. Data is just a recording of your business. So,
    • If your data has no value, your business doesn’t either.
    • If your business has value, you probably won’t give your data to a single monolithic model that is available to your competitors.

As a result, organizations are likely to build their own AI to protect and enhance your business. Protecting your IP is protecting your data. So it seems likely that you’ll protect your digital exhaust.1

The Rise of Open Source Data and Models. This week with friends at Together, we helped build RedPajama, high quality models and datasets inspired by our friends at Meta’s LLaMA (and story-reading parents everywhere!). Once high quality data was available, we saw the proliferation of a few new models per week this past quarter–with lovely “it’s so over” twitter taglines abounding (I liked “huge, if true” better). However, these data sets and models are going to be free for commercial use, and we expect they will unlock a huge number of research and enterprise use cases.

We’re relatively early in this wave, and we’ve coined a phrase “the Linux moment for AI” to encapsulate our view on what’s going on and what might happen. We will soon have state-of-the-art quality, fully open source AI models as powerful as the most powerful AI systems today. Even OpenAI is starting to say the race for simply larger models is coming to a close. As a result, the cost to train existing model sizes will rapidly commoditize. So the next questions are: where does the race move next? What are the consequences of open models and more cost-effective AI models?

Our view is that once you remove the limiting reagent of freely available general purpose models, the era of GPT-X gives way to GPT-You. Technically, this means the world will be more data-centric than ever before. The barriers to refine and use your data have never been lower, but using it well and in a cost effective way will continue to be major challenges–but tied more directly to value creation than in the previous generation.

If we’re right a few key trends will become major.

  • ChatGPT to EnterpriseGPTs and GPT-You. We’ve been writing about how we think we’ll move from chat and human-in-the-loop based systems to these models powering the digital economy. The enterprise workflow is only a fraction of chat, but much more will be disrupted including basics like form processing, data extraction, accounting, and more. This necessitates entirely new paradigms of batch processing in the models, and the cost structure to use them are major barriers. We’ll see these costs come down and AI become widespread.
  • A Copilot for Many Functions. Copilots will move from just coding aids to ubiquitous. Interactions will often be mediated–copilots for emails, copilots for design, copilots for feedback, for data pipelines etc. These copilots will become ubiquitous because they use your digital exhaust to supercharge your existing processes and be a more efficient version of you. From a technology standpoint, there will be a race to the bottom here because both general purpose open copilots will exist for popular items and then they will be refined to better align with your preferences. To build a better you, you need your data.
  • The Realignment of Teams Due to AI. The next phase after supercharging existing pipelines is rethinking them entirely. This is most obvious in software: so many jobs in software were hard because small details required brainpower to get right, but we really only needed them 80% right. I saw this in our work on the predecessor to Bootleg at Apple. We saw a team of 20 FTE move to more exciting tasks and needed only 0.5 FTE to ship 10s of items, because we managed to package the workflow using the first-wave of GPT-like systems. This is more true than ever in organizations where a large number of different people need to contribute small amounts of expertise to a valuable business pipeline, e.g., mortgages require 10s of people to contribute their particular expertise, but a lot of the work is routine. Anecdotally, large enterprises are awash in such pipelines in accounting, marketing, finance, and more. Many of those decisions are routine, especially once they are recorded in data–the copilot or foundation model of the future can greatly speed these workflows along.
  • How you Deploy Software Changes: The Move to a Validate-then-Build World. As Chris Aberger (CEO Numbers Station) says, we’ve moved from a build-and-validate world to a validate-and-then-build world. This is a huge change in how your (internal or external) customers are going to buy software and interact with you, they’ll want to see and touch the service and you’ll be able to provide it easily and cheaply–as a prototype. So much of AI and software deployment is held back because the specification of what someone is building doesn’t match the intended use case–and the project fails to deliver value. This offers the ability to align your goals and outcomes far more cheaply than we could before. The ability to rapidly prototype with this new generation of software enables entirely new delivery models that should be dramatically more efficient. Customers can use demonstrations during brief scoping conversations to confirm their suspicion about what might be in the data or how people will use software. This rapid turnaround will enable people to deploy software in places that are currently not possible.

This is not an attempt to be an exhaustive list and it’s certainly speculative, but all of these trends have one major trend that seems too big to miss in common: they are centered around the data. We’ve had decades of your data as oil, requiring expensive refining capability–but in the next world, refineries are dramatically cheaper. Excellence and talent still matter in this world to be cost effective, to obtain the highest quality, and more–but it shifts the balance in these projects to those who can most cleverly use their understanding of their own business–which now means more than ever–cleverly using all of their data. That’s why we think as a platform the biggest platform trend in the era of ChatGPT is data.

Closing Note: A Flight to Quality Data

To be valuable, you’ll still need tools to understand and refine your data and validate that your insights are correct–supercharged by these new AI models. These models do have issues with hallucinations, missing data, corrupted data, and more. It’s still challenging to learn from structured data or to obtain high accuracy in a cost effective way, these models are still much less efficient than they can be. As a result, tools like Meerkat for validation and Snorkel to build the most cost effective data refineries will be increasingly valuable. These tools have pioneered techniques to build and validate the refineries as quickly as they come up.

Footnotes


  1. There are also a few important factors here like the cost of running and managing these models, which may focus on in a later post. As we’ve mentioned before, it’s not clear you want your point of sale device to know French literature–if you’re paying for the summer in France to learn it! In addition, the cost to maintain these models and keep them up to date can be staggering-if your business is dynamic you probably need control of this data.