Many areas of machine learning require processing sequential data in an online fashion. For example, a time series may be observed in real time where the future needs to be continuously predicted, or an agent in a partiallyobserved environment must learn how to encode its cumulative experience into a state in order to navigate and make decisions. The fundamental problem in modeling longterm and complex temporal dependencies is memory: storing and incorporating information from previous time steps.
However, popular machine learning models suffer from forgetting: they are either built with fixedsize context windows (e.g. attention), or heuristic mechanisms that empirically suffer from from a limited memory horizon (e.g., because of “vanishing gradients”).
This post describes our method for addressing the fundamental problem of incrementally maintaining a memory representation of sequences from first principles. We will:

find a technical formulation of this problem that we can analyze mathematically, and derive a closedform solution with the HiPPO framework

see that our method can be easily integrated into endtoend models such as RNNs, where our framework both generalizes previous models, including the popular LSTM and GRU, and improves on them, achieving state of the art on permuted MNIST, a popular benchmark for longrange memory.

show how insights from the framework reveal methods with distinct theoretical properties – we highlight a particular model HiPPOLegS, which is computationally efficient, provably alleviates vanishing gradients, and is the first known method to display “timescale robustness”!
Our paper was accepted to Neurips 2020 as a Spotlight, and our code is publically available with PyTorch and Tensorflow implementations.
Online Function Approximation: A Formalism for Incremental Memory Representations
Our first insight is to move from discretetime to the continuoustime setting, which is often easier to analyze theoretically. We ask the following very natural question: given a continuous function (in one dimension) $f(t)$, can we maintain a fixedsize representation $c(t) \in \mathbb{R}^N$ at all times $t$ such that $c(t)$ optimally captures the history of $f$ from times $0$ to $t$?
However, this problem is not fully welldefined yet – we need to specify:

Quality of approximation: What is the “optimal approximation” of the function’s history? We need to specify a measure (or weight function) that tells us how much we care about every time in the past.

Basis: How can we compress a continuous function into a fixedlength vector? We can project the function onto a subspace of dimension $N$ and store the $N$ coefficients of its expansion in any basis. For simplicity, we will assume that we are working with the polynomial basis throughout this post.
Intuitively, we can think of the memory representation $c(t) \in \mathbb{R}^N$ as being the coefficient vector of the optimal polynomial approximation to the history of $f(t)$.
The HiPPO Framework (HighOrder Polynomial Projection Operator)
Notice that given a measure (and assuming the polynomial basis), the online function approximation problem is now fully specified!^{1} That is, given any input function $f(t)$, the desired coefficient vectors $c(t)$, which are our desired memory representation, are completely defined. The question remains – how do we calculate them?
The HiPPO framework formalizes this problem and provides machinery to compute the solution. Although the desired coefficients $c(t)$ are rather abstractly defined as the implicit solution to an approximation problem, there is amazingly a closedform solution that’s easy to compute. We’ll leave the technical details to the full paper, but we’ll note that they leverage classic tools from approximation theory such as orthogonal polynomials^{2}. In the end, the solution takes on the form of a simple linear differential equation, which is called the HiPPO operator:
\[\dot{c}(t) = A(t) c(t) + B(t) f(t)\]In short, the HiPPO framework takes a family of measures, and gives an ODE with closedform transition matrices $A(t), B(t)$. These matrices depend on the measure, and following these dynamics finds the coefficients $c(t)$ that optimally approximate the history of $f(t)$ according to the measure.
Instantiations of HiPPO
Figure 2 shows some concrete examples of HiPPO. We show two of the simplest family of measures, based off uniform measures. The translated Legendre measure on the left uses a fixedlength sliding window; in other words, it cares about recent history. On the other hand, the scaled Legendre measure uniformly weights the entire history up to the current time. In both cases, the HiPPO framework produces closedform formulas for the corresponding ODEs which are shown for completeness (the transition matrices are actually quite simple!).
From continuoustime to discretetime
There is one more detail called discretization. By using standard techniques for approximating the evolution of dynamical systems, the continuoustime HiPPO ODE can be converted to a discretetime linear recurrence. Additionally, this step allows extensions of HiPPO to flexibly handle irregularlysampled or missing data: simply evolve the system according to the given timestamps.
For the practitioner: To construct a memory representation $c_t$ of an input sequence $f_t$, HiPPO is implemented as the simple linear recurrence $c_{t+1} = A_t c_t + B_t f_t$ where the transition matrices $A_t, B_t$ have closedform formulas. That’s it!
Hippos in the wild: integration into ML models
At its core, HiPPO is a simple linear recurrence that can be integrated into endtoend models in many ways. We focus on a recurrent neural network (RNN) due to their connection to dynamics systems involving a state evolving over time, just as in HiPPO. The HiPPORNN is the simplest way to perform this integration:

Start with a standard RNN recurrence $h_{t} = \tau(h_{t1}, x_t)$ that evolves a hidden state $h_t$ by any nonlinear function $\tau$ given the input $x_t$

Project the state down to a lower dimension feature $f_t$

Use the HiPPO recurrence to create a representation $c_t$ of the history of $f_t$, which is also fed back into $\tau$
Special cases of the HiPPORNN
Those familiar with RNNs may notice this looks very similar to cell diagrams for other models such as LSTMs. In fact, several common models are closely related:

The most popular RNN models are the LSTM and GRU, which rely on a gating mechanism. In particular, the cell state of an LSTM performs the recurrence $c_{t+1} = \alpha_t c_t + \beta_t f_t$, where $\alpha_t, \beta$ are known as the “forget” and “input” gates. Notice the similarity to the HiPPO recurrence $c_{t+1} = A_t c_t + B_t f_t$. In fact, these gated RNNs can be viewed as as a special case of HiPPO with loworder (N=1) approximations and inputdependent discretization! So HiPPO sheds light on these popular models and shows how the gating mechanism, which was originally introduced as a heuristic, could have been derived.

The HiPPOLegT model, which is the instantiation of HiPPO for the translated Legendre measure, is exactly equivalent to a recent model called the Legendre Memory Unit^{3}. Our proof is also much shorter, and just involves following the steps of the HiPPO framework!
Elephants Hippos never forget
Let’s take a look at how these models perform on benchmarks. First, we test if HiPPO solves the problem it was designed to – online function approximation. Figure 4 shows that it can approximate a sequence of a million time steps with good fidelity. Keep in mind that this works while processing the function online with a limited budget of hidden units; it could have reconstructed the partial function at any point in time.
Second, we test on the standard Permuted MNIST benchmark, where models must process the input image one pixel at a time and output a classification after consuming the entire sequence. This is a classic benchmark for testing longterm dependencies in sequence models, since they must remember inputs from almost 1000 time steps ago.
Multiple instantiations of our HiPPO framework, including the HiPPOLegS and HiPPOLegT models described above, set stateoftheart over other recurrent models by a significant margin, achieving 98.3% test accuracy compared to the previous best of 97.15%. In fact, they even outperform nonrecurrent sequence models that use global context such as dilated convolutions and transformers. Full results can be found in Tables 4+5.
Timescale Robustness of HiPPOLegS
Lastly, we’ll explore some theoretical properties of our most interesting model, corresponding to the Scaled Legendre (LegS) measure. As motivation, the discerning reader may be wondering by this point: What’s the difference between different instantiations of HiPPO? How does the measure influence the model? Here are some examples of how intuitive interpretation of the measure translates into theoretical properties of the downstream HiPPO model:

Gradient bounds: Since this measure says that we care about the entire past, information should propagate well through time. Indeed, we show that gradient norms of the model decay polynomially in time, instead of exponentially (i.e. the vanishing gradient problem for vanilla RNNs).

Computational efficiency: The transition matrices $A_t$ actually have special structure, and the recurrence can be computed in linear instead of quadratic time. We hypothesize that these efficiency properties are true in general (i.e., for all measures), and are related more broadly to efficiency of orthogonal polynomials and their associated computations (e.g. [link to SODA paper])

Timescale robustness: Most interestingly, the scaled measure is agnostic to how fast the input function evolves; Figure 5 illustrates how HiPPOLegS is intuitively dilation equivariant.
The table shows results on a trajectory classification dataset with distribution shift between the training and test sequences (i.e., arising from time series being sampled at different rates at deployment); HiPPO is the only method that can generalize to new timescales!
Generalization  LSTM  GRUD  ODERNN  NCDE  LMU  HiPPOLegS 

100Hz > 200Hz  25.4  23.1  41.8  44.7  6.0  88.8 
200Hz > 100Hz  64.6  25.5  31.5  11.3  13.1  90.1 
Conclusion

The problem of maintaining memory representations of sequential data can be tackled by posing and solving continuoustime formalisms.

The HiPPO framework explains several previous sequence models as well as produces new models with cool properties.

This is just the tip of the iceberg  there are many technical extensions of HiPPO, rich connections to other sequence models, and potential applications waiting to be explored!
Try it out
PyTorch and Tensorflow code for HiPPO are available on GitHub, where the HiPPORNNs can be used as a dropin replacement for most RNNbased models. Closedform formulas and implementations are given for the HiPPO instantiations mentioned here, and several more. For more details, see the full paper.
Footnotes

A measure induces a Hilbert space structure on the space of functions, so that there is a unique optimal approximation  the projection onto the desired subspace. ↩

Examples of famous orthogonal polynomials include the Chebyshev polynomials and Legendre polynomials. The names of our methods, such as LegS (scaled Legendre), are based off the orthogonal polynomial family corresponding to its measure. ↩

The way we integrate the HiPPO recurrence into an RNN is slightly different, so the full RNN versions of the HiPPOLegT and Legendre Memory Unit (LMU) are slightly different, but the core linear recurrence is the same. ↩