Mamba Explained

rw-book-cover

Metadata

Author: The Gradient
Full Title: Mamba Explained
URL: https://thegradient.pub/mamba-explained/

Highlights

Mamba, however, is one of an alternative class of models called State Space Models (SSMs). Importantly, for the first time, Mamba promises similar performance (and crucially similar scaling laws) as the Transformer whilst being feasible at long sequence lengths (say 1 million tokens). To achieve this long context, the Mamba authors remove the “quadratic bottleneck” in the Attention Mechanism. Mamba also runs fast - like “up to 5x faster than Transformer fast”1. (View Highlight)
Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation. (View Highlight)
We’re very much in the Transformer-era of history. ML used to be about detecting cats and dogs. Now, with Transformers, we’re generating human-like poetry, coding better than the median competitive programmer, and solving the protein folding problem. But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions. For this lookback, we cache detailed information about each token in the so-called KV cache. (View Highlight)
This pairwise communication means a forward pass is O(n²) time complexity in training (the dreaded quadratic bottleneck), and each new token generated autoregressively takes O(n) time. In other words, as the context size increases, the model gets slower. To add insult to injury, storing this key-value (KV) cache requires O(n) space. Consequently, the dreaded CUDA out-of-memory (OOM) error becomes a significant threat as the memory footprint expands. If space were the only concern, we might consider adding more GPUs; however, with latency increasing quadratically, simply adding more compute might not be a viable solution. On the margin, we can mitigate the quadratic bottleneck with techniques like Sliding Window Attention or clever CUDA optimisations like FlashAttention. But ultimately, for super long context windows (like a chatbot which remembers every conversation you’ve shared), we need a different approach. (View Highlight)
Fundamentally, all good ML architecture backbones have components for two important operations:
1. Communication between tokens
2. Computation within a token The Transformer Block (View Highlight)
In transformers, this is Attention (communication) and MLPs (computation). We improve transformers by optimising these two operations2. We would like to substitute the Attention component3 with an alternative mechanism for facilitating inter-token communication. Specifically, Mamba employs a Control Theory-inspired State Space Model, or SSM, for Communication purposes while retaining Multilayer Perceptron (MLP)-style projections for Computation. The Mamba Block Like a Transformer made up of stacked transformer blocks, Mamba is made up of stacked Mamba blocks as above. (View Highlight)
Motivating Mamba - A Throwback to Temple Run Imagine we’re building a Temple Run agent4. It chooses if the runner should move left or right at any time. To successfully pick the correct direction, we need information about our surroundings. Let’s call the collection of relevant information the state. Here the state likely includes your current position and velocity, the position of the nearest obstacle, weather conditions, etc.

Claim 1: if you know the current state of the world and how the world is evolving, then you can use this to determine the direction to move. Note that you don’t need to look at the whole screen all the time. You can figure out what will happen to most of the screen by noting that as you run, the obstacles move down the screen. You only need to look at the top of the screen to understand the new information and then simulate the rest. This lends itself to a natural formulation. Let h be the hidden state, relevant knowledge about the world. Also let x be the input, the observation that you get each time. h’ then represents the derivative of the hidden state, i.e. how the state is evolving. We’re trying to predict y, the optimal next move (right or left). (View Highlight)
Now, Claim 1 states that from the hidden state h, h’, and the new observation x, you can figure out y. More concretely, h, the state, can be represented as a differential equation (Eq 1a): h′(t)=Ah(t)+Bx(t)h’(t) = \mathbf{A}h(t) + \mathbf{B}x(t) Knowing h allows you to determine your next move y (Eq 1b): y(t)=Ch(t)+Dx(t)y(t) = \mathbf{C}h(t) + \mathbf{D}x(t) The system’s evolution is determined by its current state and newly acquired observations. A small new observation is enough, as the majority of the state can be inferred by applying known state dynamics to its previous state. That is, most of the screen isn’t new, it’s just a continuation of the previous state’s natural downward trajectory. A full understanding of the state would enable optimal selection of the subsequent action, denoted as y. You can learn a lot about the system dynamics by observing the top of the screen. For instance, increased velocity of this upper section suggests an acceleration of the rest of the screen as well, so we can infer that the game is speeding up5. In this way, even if we start off knowing nothing about the game and only have limited observations, it becomes possible to gain a holistic understanding of the screen dynamics fairly rapidly. (View Highlight)
Here, state refers to the variables that, when combined with the input variables, fully determine the future system behaviour. In theory, once we have the state, there’s nothing else we need to know about the past to predict the future. With this choice of state, the system is converted to a Markov Decision Process. Ideally, the state is a fairly small amount of information which captures the essential properties of the system. That is, the state is a compression of the past6. (View Highlight)
Discretisation - How To Deal With Living in a Quantised World Okay, great! So, given some state and input observation, we have an autoregressive-style system to determine the next action. Amazing! In practice though, there’s a little snag here. We’re modelling time as continuous. But in real life, we get new inputs and take new actions at discrete time steps7. (View Highlight)
We would like to convert this continuous-time differential equation into a discrete-time difference equation. This conversion process is known as discretisation. Discretisation is a well-studied problem in the literature. Mamba uses the Zero-Order Hold (ZOH) discretisation8. To give an idea of what’s happening morally, consider a naive first-order approximation9. (View Highlight)
Now, we can interpret the A, B, C, D matrices more intuitively: • A is the transition state matrix. It shows how you transition the current state into the next state. It asks “How should I forget the less relevant parts of the state over time?” • B is mapping the new input into the state, asking “What part of my new input should I remember?”11 • C is mapping the state to the output of the SSM. It asks, “How can I use the state to make a good next prediction?”12 • D is how the new input passes through to the output. It’s a kind of modified skip connection that asks “How can I use the new input in my prediction?” (View Highlight)
At WWDC ‘97, Steve Jobs famously noted that “focusing is about saying no”. Focus is ruthless prioritisation. It’s common to think about Attention positively as choosing what to notice. In the Steve Jobs sense, we might instead frame Attention negatively as choosing what to discard. There’s a classic intuition pump in Machine Learning known as the Cocktail Party Problem13. Imagine a party with dozens of simultaneous loud conversations: Question: How do we recognise what one person is saying when others are talking at the same time?14 Answer: The brain solves this problem by focusing your “attention” on a particular stimulus and hence drowning out all other sounds as much as possible. (View Highlight)
Transformers use Dot-Product Attention to focus on the most relevant tokens. A big reason Attention is so great is that you have the potential to look back at everything that ever happened in its context. This is like photographic memory when done right.15 Transformers (🤖) are extremely effective. But they aren’t very efficient. They store everything from the past so that they can look back at tokens with theoretically perfect recall. Traditional RNNs (🔁) are the opposite - they forget a lot, only recalling a small amount in their hidden state and discarding the rest. They are very efficient - their state is small. Yet they are less effective as discarded information cannot be recovered. We’d like something closer to the Pareto frontier of the effectiveness/efficiency tradeoff. Something that’s more effective than traditional RNNs and more efficient than transformers. The Mamba Architecture seems to offer a solution which pushes out the Pareto frontier of effectiveness/efficiency. (View Highlight)
The Selection Mechanism Selectivity allows each token to be transformed into the state in a way that is unique to its own needs. Selectivity is what takes us from vanilla SSM models (applying the same A (forgetting) and B (remembering) matrices to every input) to Mamba, the Selective State Space Model. In regular SSMs, A, B, C and D are learned matrices - that is A=Aθ\mathbf{A} = \mathbf{A}{\theta} etc. (where θ represents the learned parameters) With the Selection Mechanism in Mamba, A, B, C and D are also functions of x. That is A=Aθ(x)\mathbf{A} = \mathbf{A}{\theta(x)} etc; the matrices are context dependent rather than static. Mamba (right) differs from traditional SSMs by allowing A,B,C matrices to be selective i.e. context dependent (source) (View Highlight)
Making A and B functions of x allows us to get the best of both worlds: • We’re selective about what we include in the state, which improves effectiveness vs traditional SSMs. • Yet, since the state size is bounded, we improve on efficiency relative to the Transformer. We have O(1), not O(n) space and O(n) not O(n²) time requirements. The Mamba paper authors write: The efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension. (View Highlight)
Humans (mostly) don’t have photographic memory for everything they experience within a lifetime - or even within a day! There’s just way too much information to retain it all. Subconsciously, we select what to remember by choosing to forget, throwing away most information as we encounter it. Transformers (🤖) decide what to focus on at recall time. Humans (🧑) also decide what to throw away at memory-making time. Humans filter out information early and often. If we had infinite capacity for memorisation, it’s clear the transformer approach is better than the human approach - it truly is more effective. But it’s less efficient - transformers have to store so much information about the past that might not be relevant. Transformers (🤖) only decide what’s relevant at recall time. The innovation of Mamba (🐍) is allowing the model better ways of forgetting earlier - it’s focusing by choosing what to discard using Selectivity, throwing away less relevant information at memory-making time16. (View Highlight)
Applying the Selection Mechanism does have its gotchas though. Non-selective SSMs (i.e. A,B not dependent on x) are fast to compute in training. This is because the component of Yt which depends on xi can be expressed as a linear map, i.e. a single matrix that can be precomputed! For example (ignoring the D component, the skip connection): y2=CBx2+CABx1+CAABx0 y_2 = \mathbf{C}\mathbf{B}x_2 + \mathbf{C}\mathbf{A}\mathbf{B}x_1 + \mathbf{C}\mathbf{A}\mathbf{A}\mathbf{B}x_0 If we’re paying attention, we might spot something even better here - this expression can be written as a convolution. Hence we can apply the Fast Fourier Transform and the Convolution Theorem to compute this very efficiently on hardware as in Equation 3 below. (View Highlight)
Machine Learning for Political Economists - How Large Should The State Be? The Mamba authors write, “the efficiency vs. effectiveness tradeoff of sequence models is characterised by how well they compress their state”. In other words, like in political economy18, the fundamental problem is how to manage the state. 🔁 Traditional RNNs are anarchic They have a small, minimal state. The size of the state is bounded. The compression of state is poor. 🤖 Transformers are communist They have a maximally large state. The “state” is just a cache of the entire history with no compression. Every context token is treated equally until recall time. 🐍Mamba has a compressed state …but it’s selective about what goes in. Mamba says we can get away with a small state if the state is well focused and effective19. Language Models and State Size (View Highlight)
nformation Flow in Transformer vs Mamba How do Transformers know anything? At initialization, a transformer isn’t very smart. It learns in two ways:
1. Training data (Pretraining, SFT, RLHF etc)
2. In context-data Training Data Models learn from their training data. This is a kind of lossy compression of input data into the weights. We can think of the effect of pretraining data on the transformer kinda like the effect of your ancestor’s experiences on your genetics - you can’t recall their experiences, you just have vague instincts about them20. In Context-Data Transformers use their context as short-term memory, which they can recall with ~perfect fidelity. So we get In-Context Learning, e.g. using induction heads to solve the Indirect Object Identification task, or computing Linear Regression. Retrieval Note that Transformers don’t filter their context at all until recall time. So if we have a bunch of information we think might be useful to the Transformer, we filter it outside the Transformer (using Information Retrieval strategies) and then stuff the results into the prompt. This process is known as Retrieval Augmented Generation (RAG). RAG determines relevant information for the context window of a transformer. A human with the internet is kinda like a RAG system - you still have to know what to search but whatever you retrieve is as salient as short-term memory to you. (View Highlight)
Information Flow for Mamba Training Data acts similarly for Mamba. However, the lines are slightly blurred for in-context data and retrieval. In-context data for Mamba is compressed/filtered similar to retrieval data for transformers. This in-context data is also accessible for look-up like for transformers (although with somewhat lower fidelity). Transformer context is to Mamba states what short-term is to long-term memory. Mamba doesn’t just have “RAM”, it has a hard drive21 22. (View Highlight)
Swapping States as a New Prompting Paradigm Currently, we often use RAG to give a transformer contextual information. With Mamba-like models, you could instead imagine having a library of states created by running the model over specialised data. States could be shared kinda like LoRAs for image models. For example, I could do inference on 20 physics textbooks and, say, 100 physics questions and answers. Then I have a state which I can give to you. Now you don’t need to add any few-shot examples; you just simply ask your question. The in-context learning is in the state. In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges. And note that “training” a state doesn’t require any backprop. It’s more like a highly specialised one-pass fixed-size compression algorithm. This is unlimited in-context learning applied at inference time for zero-compute or latency23. (View Highlight)
The structure of an effective LLM call goes from…
1. System Prompt
2. Preamble
3. Few shot-examples
4. Question …for Transformers, to simply…
5. Inputted state (with problem context, initial instructions, textbooks, and few-shot examples)
6. Short question …for Mamba. This is cheaper and faster than few-shot prompting (as the state is infinitely reusable without inference cost). It’s also MUCH cheaper than finetuning and doesn’t require any gradient updates. We could imagine retrieving states in addition to context. (View Highlight)
Mamba & Mechanistic Interpretability Transformer interpretability typically involves:
1. understanding token relationships via attention,
2. understanding circuits, and
3. using Dictionary Learning for unfolding MLPs. Most of the ablations that we would like to do for Mamba are still valid, but understanding token communication (1) is now more nuanced. All information moves between tokens via hidden states instead of the Attention Mechanism which can “teleport” information from one sequence position to another. For understanding in-context learning (ICL) tasks with Mamba, we will look to intervene on the SSM state. A classic task in-context learning task is Indirect Object Identification in which a model has to finish a paragraph like: (View Highlight)
Mamba-like models are likely to excel in scenarios requiring extremely long context and long-term memory. Examples include: • Processing DNA • Generating (or reasoning over) video • Writing novels An illustrative example is agents with long-term goals. Suppose you have an agent interacting with the world. Eventually, its experiences become too much for the context window of a transformer. The agent then has to compress or summarise its experiences into some more compact representation. But how do you decide what information is the most useful as a summary? If the task is language, LLMs are actually fairly good at summaries - okay, yeah, you’ll lose some information, but the most important stuff can be retained. However, for other disciplines, it might not be clear how to summarise. For example, what’s the best way to summarise a 2 hour movie?24. Could the model itself learn to do this naturally rather than a hacky workaround like trying to describe the aesthetics of the movie in text? This is what Mamba allows. Actual long-term memory. A real state where the model learns to keep what’s important. Prediction is compression - learning what’s useful to predict what’s coming next inevitably leads to building a useful compression of the previous tokens. (View Highlight)
The Best Collab Since Taco Bell & KFC: 🤖 x 🐍 The Mamba authors show that there’s value in combining Mamba’s long context with the Transformer’s high fidelity over short sequences. For example, if you’re making long videos, you likely can’t fit a whole movie into a Transformer’s context for attention26. You could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency27.
This isn’t the end for Transformers. Their high effectiveness is exactly what’s needed for many tasks. But now Transformers aren’t the only option. Other architectures are genuinely feasible. So we’re not in the post-Transformer era. But for the first time, we’re living in the post-only-Transformers era28. And this blows the possibilities wide open for sequence modelling with extreme context lengths and native long-term memory. (View Highlight)