Nov 18, 2024 · 21 min read

An Unserious Person’s Take on Axiomatic Knowledge in the Era of Foundation Models

TLDR; This post tries to explain why we started the work that led to Evo, which recently was on the cover of Science–thanks to a host of wonderful collaborators at Arc [Brian Hie, Patrick Hsu]. These are some rantings about things that interest me about axiomatic knowledge and foundation models, and why I think foundation models for science are philosophically interesting. Do not take it seriously, I am not a serious person.

The beauty of axioms and the messiness of AI

I’ve exalted the axiomatic and classical mathematical ways of describing knowledge; they felt true, deep, and interesting. I grew up as a math undergrad, and I remain astonished at how so much math follows from so few axioms. It’s stunning, beautiful, and interesting. But in AI, I’ve observed that more closely adhering to our intellectual siblings wasn’t always a recipe for more success. Foundation models are a tool that provide a new way to store and access knowledge and data: messy, imperfect, but also pretty fun. A common refrain is that the new systems must look like the old generation of these knowledge systems: clean, predictable, formally reliable to be taken seriously… but I’m not so sure… that didn’t happen with web search: we didn’t make it return exact results like a database system, but it’s still incredibly useful. One path is to make foundation models reliable, and there is a lot of interest in that–awesome! This is a path we’re excited about too. However, another path that we're pursuing is to embrace the messy unknown of the pattern matching tool we call foundation models. Despite our less-than-axiomatic understanding of these models, could we still use them as tools to benefit different areas? I’ll explain why Eric and Michael looked at DNA foundation models, and then try to articulate what I’m excited about next–and dress it up as something of philosophical import:

"What is the value of axiomatic knowledge?"

I hope that thinking about this question leads to areas where foundation models break, we learn something new, and hopefully we question some of our own dogma…

Another way to look at it, is that this article is me wrestling with professional demons to try to understand my continued failures and to be a better advisor for our next project. As I write this, I realize it’s likely doomed from the start. Here we go!

My love of Logic and Math didn’t obviously lead to systems that worked.

When I got into AI, one of the things that was most interesting to me was fusing knowledge and statistical reasoning. Back then, this meant graphical models defined by logical rules in frameworks like Markov Logic. The idea was appealing because it felt like there are two modes of thinking, logical and statistical, and that a framework that fused them could allow us to build a system that combined the best of both worlds. Awesome! It was super interesting research, but it didn’t build the alien artifacts that I was dreaming of, so we kept searching…

When we were building our first AI systems, Tuffy and DeepDive, we started seeing that stacking embeddings and scaling up seemed to beat our clever techniques. My first PhD student Feng Niu was way ahead here, I learned a ton from our discussions (even more now!) This was ca 2010; at this point in CS, it felt like logic was a core element of computing: we taught things like automata. It was exalted as a pure–even holy–form of reasoning that could frame computing questions in the absence even of filthy machines.¹ Why did we feel that way in this little area? Well, I think because most of the people in our field had a mathematical mindset, and so our best and favorite example of knowledge was mathematical. And most of it, which I loved so much, had been written down in beautiful axiomatic form. We were taught to revel in the fact that all mathematics flowed from such simple axioms. And as an aside, it continues to blow my mind. It’s just so damn cute. We felt that our machine learning systems needed to take advantage of our hard won axiomatic knowledge. But what if this bias towards cute was misleading…²

Now, this bias toward mathematical envy was also true in machine learning at the time. Due to hanging out with the wrong people, I became convinced I was actually doing machine learning research…³ In 2010, my first real machine learning paper was Hogwild!, under the tutelage of the aforementioned bad influences. At this time, the Bayesian high priesthood of statistics and convex analysis ruled the land. They intimidated us using fancy integrals, diagrams–and pure oozing confidence. In contrast, neural nets were used by monsters, maniacs, and miscreants… jeers such as, “don’t cast your eyes on the deep learning people hiding in Neurips workshops with working demos lest you fall into sin!” were forthcoming… Well maybe not, but I used to drink then, and my point is that we give the neural nets folks less love than we give them now … and mostly, I thought it was a funny way to write it.

What's the mean and variance of the data behind this figure? Should we even care? The result gives us joy; maybe we should embrace that.

My point is that this state of affairs caused a lot of reflection for me: how could the orthodoxy have been so misleading? Part of the reason is that we inherited the central dogma of statistics. Stats people want to do inverse probability, they want to recover a true model of the universe from observational data–and they are good at it! However, they were not directly concerned with the predictions of the model per se… only in that it seemed obvious that a right thinking person should want the predictions of the right model. Predictions from a not-right model (like a neural net)? You must be courting a fallen angel. As a result, we obsessed that our smarter siblings in stats told us we needed to find the right model before we could predict anything. We worried a lot about this one true model we could poke and say was correct.⁴

The point I’m poorly making is that again, we had this really compelling aesthetic central dogma that appealed to me–maybe not as much as say logical axioms, but nonetheless it did appeal to me. It felt certain, warm, and comfortable in a land of chaotic and terrible data. And I don’t think I was alone in this feeling… but like the litany of NP relaxation algorithms in 2000s papers, or those Bayesian tales, we didn’t quite make any alien technology appear in spite of lots of trying. It was all kind of janky and broken. And worse, the nasty things, the truly immoral deep learning things we were doing… well they seemed to work a bit better. For shame! Recall that the devil makes initial sin pleasurable for a reason! Or in this case, maybe not… we knew the assumptions for the models were wrong, but we assumed as the saying goes they were useful... maybe not always useful…

The third field I encountered was natural language processing (NLP). I showed up at a few events due to kind people like Chris Manning and Andrew McCallum… they all studied language, but in so many interesting ways. The conferences had individual rooms studying things like extraction, role labeling, named entity recognition, deduplication.. When foundation models came on the scene, many of these areas were advanced due to the same artifact: a foundation model. It was pretty shocking to me! These models are broad and interesting, but far from perfect.

Now, you could take from the above that I think academic knowledge isn’t worth that much or am somehow looking back like we often do of the past, like this smug space dolphin:

"oh, those primitive people, how could they believe the earth was flat and didn’t know about scaling laws!"

I think the opposite. I continue to read these areas because they are intrinsically valuable, and it's a recording of brilliant people’s ideas that make our lives more interesting, better understood, and better able to grapple with the challenges of the world. In fact, I think they are so beautiful, valuable, and written by obvious geniuses, that we are often tempted to believe that their truths carry on outside their operating region. What I’m saying is that for some predictive problems, maybe being too smart gets in the way. That is, our hubris is really about the operating regimes of our knowledge, and in those cases our massive reverence seems to prevent us from making progress.⁵ These areas should inspire us but we shouldn’t also be quite so certain that deeper understanding or application of older ideas is the key to forward progress on new questions.⁶

Closer to home, I spent a bunch of my career on techniques that were not super fruitful (no problem–I learned a ton and got to work with smart people). This is good in science–and our product is the people not the research. But also, it’s kind of unsatisfying. What I’m observing is that in these cases, I made a mistake which is I didn’t question basic assumptions or the authority of the experts–even when those experts were not in their field of expertise. I was so blinded by brilliance that I thought brilliance transferred instead of inspired. Foundation models didn’t obviate linguistics, they gave us a new tool to ask and answer new types of questions and provoke some new ideas. If anything that made computational linguistics more interesting for some researchers. So I wondered where else might this be true more broadly than my land of classifiers, NSFW chat bots, and GPUs going brrr…Maybe these new techniques give us a new kind of instrument or foundation to build on, and they actually deserve to be used. A new telescope, as the typical grant trope goes, to see farther than ever before! Or whatever, you get my point.

So where are we going to point our new telescope (foundation model)?

Well, I’ve always been really into medicine, health, and biology because, you know, it makes people’s lives better. That seems worth doing. It’s also true that I have said many times that I think the human body is disgusting, and I am uninterested in studying it. Like everyone else, I’m full of contradictions. When my colleagues in the department would nudge me to study comp bio, I shuddered. There were none of the axioms that I associated with depth of understanding–purely in my weird aesthetic. What this article is really about is exploring the myriad ways I have been and continue to be wrong.

However, one of my many talented colleagues, Gill Bejerano, is brilliant and wise. He said something that always stuck with me:

"Biological knowledge is as vast as the pacific ocean…. but only three meters deep."

I think he meant to convey that there is so much to know in biology, but in any area we are just learning the basics. This really stuck with me. Also, these folks are inspiring. It pays to have interesting colleagues. What does this have to do with me or this blog? Well, this quote was why I nudged our folks toward HyenaDNA. Our caricature of foundation models is that they are broad, general reasoners that can do “obvious but not deep” stuff in a wide variety of areas… maybe in light of Gill’s quote, it’s an area where if breadth is the name of the game, eventually foundation models could be useful?

I don’t know anything about biology, so we needed to figure out how to spark people’s interest–our moves are shitpost and get people excited/angry/memed up. From a technical perspective, we had something that could finally do long context, and it made sense to look at an important big sequence that most people have heard of… DNA. It was also pretty cool that we didn’t need tokenizers, but that’s super in the nerd weeds. So could the story in linguistics play out in biology using DNA as substrate? There is information in there that’s not obvious to us, maybe some of that from the sheer vastness? Could we build a new foundation that was useful for basic questions in biology? I was clearly the wrong person to do any of this seriously, but I’m a certified shitposter. Eric Nguyen was super into the idea, and he with Michael Poli trained HyenaDNA. They did all the work of making it work and had the vision of how and where it could work–students are great at turning shitposts into actual research, or at the very least not letting themselves get too distracted. They started this fire, and I’m excited to watch it burn? That sounds grimmer than I mean. Check out our blog for more details.⁷

It was a lot of fun. Over the last year, we were approached by a fair number of teams from fancy places with smart people who have very different ideas and incentives. It was very exciting, but the Arc people – and Brian Hie in particular – were clearly much better equipped than us to push this forward. We threw our lot in with them, and together hopefully did something fun [Arc blog, Hazy blog, biorxiv, youtube talk]. Brian is amazing and fun to work with, and he is again a pretty good reason to hang out around a university. He knows and cares about science. Helping to enable Brian is a professional treat, and we’re delighted that work was featured on the cover of Science.

Back from DNA to Equations…

Cute dolphin image — Should we expect dolphins (or shoggoths) to think the same as we do?

So where does this leave us? Well, I’m happy to help push Brian and Eric’s agenda – I’m a professor to be an enabler not a leader – but I’m still worried about something else. Maybe axioms as knowledge are just too clever by half? They’re cute and elegant and really interesting, but maybe they’re just too cute for their own good… and why should science be cute? If this AI thing can do “obvious stuff”, why should it be our previously discovered, axiomatic obvious stuff? If it does, that’s wild… like mind-blowingly wild… it’s like finding a race of dolphins on a far flung moon with libertarian views about taxes.⁸

So back to this… I’m really interested in when equational knowledge matters and when it doesn’t. Differential equations seem like a natural starting point. Ever since Kepler deduced laws of planetary motion from Brahe’s painstakingly accurate data – another case where new instruments yielded new paradigms of knowledge – differential equations have been the go-to tool for modeling everything from climate to financial data. Are there other interesting ways of mapping and understanding differential equations, inspired by foundation models' view of the world? PDEs and ODEs are representations of really complex systems with all kinds of constraints and information that are tough to write down axiomatically – and they are objects I love… so who knows.

Why might foundation models be interesting to venture into the models typically done with PDEs and ODEs? Does a PDE capture all the information we know is important about a system? For example, when looking at a problem, it’s not always clear which equation is governing its behavior. PDEs have strict operating regimes, so they can require a lot of tuning and deployment to capture important dynamics. The operating regime question might point to a deeper issue. Very roughly, Nancy Cartwright argues that physics is more related to pattern matching than we believe from our mathematical upbringing.⁹ Here is (loosely) one of her examples:

"You go to a window. You drop a bowling ball out of it. Since you know about gravity and Newton's laws, you feel confident that you know where that bowling ball will land. Awesome! … now you drop a feather…"

Those same forces are still there, but where does that feather go? You’re in an entirely new operating regime. In this case, the change in operating regime is obvious to us, but are the operating regimes obvious to us in a turbulence simulation? Are there forms of this issue in which there are not big operating regimes but an array of a few major ones? Are there more patterns to match? It might be tough for us to recognize or describe these vague patterns, but maybe newer foundation models can pull and match those patterns with their high-dimensional goo? Or understand the regimes in a different, maybe more holistic way? That’s pretty optimistic to be sure! In any event, Cartwright convinced this dum dum (me) that we should be open to the possibility that pattern matching engines, maybe based on future foundation models, may eventually have some unfair advantages with modeling physical systems compared to standard differential equations, so why not… This method may seem ugly from an axiomatic viewpoint… but is it? I recall Yann LeCun saying in a long-ago workshop that backprop and SGD were beautiful and elegant, and he has a point… so which of these have lasting beauty? Why not both? … And it’s entirely possible this paradigm fails completely, how interesting would that be? No matter what, I’m interested to find out.

Imprecision about precision… or is there a problem with precision?

There is a ton of interesting work in applying neural nets to this space (solving PDEs, learning solution operators, to mention just a few), but industry has yet to really adopt them. Maybe it’s just a matter of time, but maybe there is also a gap. I didn’t know, so I poked around… one thing I hear in my conversations with folks who do this modeling for real is that the precision of neural net methods is a major obstacle that prevents their wider adoption. This led us to examine something simpler: least squares. The reason this is interesting is for two reasons:

If you can solve least squares you can solve lots of ODEs (pseudospectral methods), and it’s what Brahe’s numbers were pumped into… pretty basic stuff in this space!
Plenty of papers say foundation models can solve least squares problems in interesting ways (1, 2, these two great works look into how, among many others).

So we investigated it, but something basic seems to go a bit awry in our intended setting. Here is an illustrative plot from Jerry Liu…

(Left) At first, OLS training looks reasonable. (Right) At higher precision, the gap is significant.

On the left is the error from a dear friend’s paper solving OLS with a foundation model, as they add examples on the x-axis, they plot the error on the y-axis. Pretty amazing; looks like these foundation models can learn least squares perfectly… They quickly get down to 0 error! But wait a second.. on the right, we look at the error in the scale that optimization and scientific computing people measure error. In this light, the foundation models don’t look like they are solving it in a way those optimization folks would care about.

What does this mean? These models don’t seem to solve equations in the way that we have come to expect from scientific computing. They learn clearly something like least squares… but maybe not exactly the same thing… and why shouldn’t they learn to even higher precision? These models have billions of parameters, and they still get crushed to moderate precision by a line of numpy with a hundred float32s. It feels kind of off… So great, there is some research to do! But what? Improving the precision of current foundation models reasoning is an obvious vector for research, and we’re trying it along. We want to preserve foundation models' generalization ability and also do it precisely… Jerry started thinking about what generalization in this context might mean. It’s still super early and raw, but we thought we’d share it.

Thank goodness it’s over… for you and me both!

So if you made it here, I am a tiny bit sorry–but really which one of the two of us is to blame?

Are we so sure cute math is a good bias for understanding? If there were some ugly idea that let us travel to other worlds, I’d be into it… even if that ugly idea lets us file our expense reports faster, that’d be a win too.
How important is axiomatic knowledge? I made this sound idiotic because that’s how I think, but I was really inspired by some amazing mathematical philosophers: Nancy Cartwright for one. She’s smarter and more careful than me, you should read her to gain perspective and clarity. In general, philosophers are pretty smart folks. If you’re interesting in collaboration and training something bizarre (say physics foundation models, let me know).
I vowed to write the weirder stuff that I actually think. It’s not for clicks and likes, but maybe someone much smarter does something useful with it–hopefully my students, but they are usually too smart to listen to me or take me seriously.

Let’s make those GPUs go brr and take a nihilist streak to all of knowledge and science. Who do we think we are? Honestly!

Acknowledgements

Several lab members gave helpful feedback in writing this post: Michael Zhang (especially for curating the beloved images), Eric Nguyen, Jerry Liu, Simran Arora, Ben Viggiano. Ben Recht is one of the most interesting people I know and many of the ideas here are due to our conversation.

Footnotes

This was in my view driven by the desire of the theoretical community to be taken seriously by mathematicians. It struck me as odd that many pure mathematicians seem to use computing power more than theoretical computer scientists. Maybe computers are base, but I’m a trash person… I love them.↩
As an aside, this always worries me about theoretical research–which I love–necessarily, theory is much more aesthetic than a comparison of rigor as you might think…. Beauty is in the eye of the beholder, which means it is harder for nature to slap us upside the head and say we’re being weird or wrong… or losing the plot. I’ve gone down too many blind alleys to blindly trust myself…↩
Ben Recht and Steve Wright set me on this dastardly path. They are too awesome to stay mad at.↩
Never mind we often picked such a model out of a hat or with analogies or with poorly understood properties like surrogates for sparsity, i.e. it was cute to write down… it was still blessed, or better used something from their smarter siblings in math…↩
In the same way, it’s not clear to me that someone who knows how to write backprop or tweak data mixtures for transformer stacks is obviously the best person to write policy… but what do I know… “never trust academics” is a good starting point.↩
My point which is poorly made here is that there are regimes in which classical stats, bayesians all work and are correct.. but they weren't all of AI problems--and they missed the operating regime of the applications that I was trying to build.↩
There are also some news articles for fun that Eric likes [Forbes, Time, Marktechpost].↩
The red-pilled dolphin is a long-running and not very funny meme in the lab. Don’t take it seriously. It’s not worth explaining, but I will: I just like the idea that a lefty graduate student is freaked out by their beloved dolphin’s political views… They’ve worked so hard to talk to that Dolphin, and now it’s saying all kinds of things that they don’t like. It’s a weak metaphor for this article: we spent so much time with math, stats, and axioms, but our foundation models might be telling us something we don’t like…dolphins in spite of being cute are not always friendly creatures…↩
h/t to Ben Recht for point me to her work. She’s great!↩