Xue J. Zhao Blog

LLM Inference Optimizations

Feb 27

KV Cache, Prefill and Decode

LLM inference is the procedure of generating new tokens autoregressively from a transformer model. For this blog we ignore hybrid architectures, and although we refer to LLM inference, our discussions applies within the generality of transformer architectures. The transformer is notable for its attention operator, a mechanism that mediates information exchange between tokens. The KV cache is the most fundamental object of interest in inference systems. Let us briefly recall why a KV cache exists. Let us consider a single attention head within a single transformer block. LLM token generation is autoregressive, this mean we start from a prompt token list, we serially generate tokens, each new token depend on the previous tokens. Experts might immediately point out that not all previous tokens are required, but let us ignore this clever idea (and many others GQA, MLA, DSA) and focus on the simplest setting of full attention, in which case each new token indeed attends to all previous token positions.

Let us describe the decode phase of inference, namely how to go from token (position) to token . When a token is appended to the sequence and passed through the LLM to generate the next token, what happens at each attention head of each transformer block? The newly appended -th token leads to a query vector which is the last column of the query matrix . The new thing that happens is we need to take the dot product of this vector with the key vectors for all positions . Since the keys are computed before, we save them in a key cache instead of recomputing them.

Next the dot products are normalized via the softmax function so that and . Now these coefficients are used to take a linear combination of the value vectors Since are computed when generating previous tokens, we save them in a value cache instead of recomputing them. Together, the key cache and the value cache make the KV cache for LLM inference.

We now know how to go from token (position) to token , but what about initialization, or in inference lingo, the prefill phase of inference: given a prompt of length , how do we take the first autoregression step to generate the first token at position ? For each attention head of each transformer block we need the keys and values of all tokens with positions . To produce the keys and values associated with the prompt we need to use the projection matrices and and multiply them with the input With out loss of generality let us focus on computing they keys matrix for the prompt, which has entries, each computed as a dot product with products and sum accumulations. So the total number of operations is . Loading and storing the matrices and requires bytes of communication, where the proportionlity constant is the number of bytes per parameter. So the computation-to-communication of prefill (aka arithmetic intensity) ratio scales in proportion to Thus prefill of lengthy prompts has the characteristics of being compute bound.

Analogously for decode, the computation of a single query (and likewise key) vector is a matrix vector product of shapes and involving operations and bytes of IO. The arithmetic intensity being The QK proudct (and likewise PV, ignoring softmax) product involves matrix and vector of shapes and for compute operations and IO communications. The arithmetic intensity scales in proportion to Both of these arithmetic intensities tends to as context position , and decode is thus communication bound. This is a rule-of-thumb analysis, where we have ignored the inverse proportional dependence of the ratio on precision (byte per param), as well as inference batch size, of which the decode intensity is proportional to. It can thus be seen that low precise high batch inference pushes decode into compute bound regime.

Speculative Decoding

Let , be two probability distributions over a finite set . If satisfies then define to be the Bernoulli random variable with success probability Let be a real valued function . Let be the distribution over defined by We can sample using the distribution as follows.

1. Sample .

2. If then accept as the sample from .

3. If . Sample .

4. If then we accept as the sample from .

5. If then we sample from the adjusted distribution and use as the sample from .

This sampling scheme applies to LLM inference: consider a sequence of tokens, we can generate the next tokens from a large (computationally expensive) model , which amounts to sampling by sampling from a small (less compute intensive) model . Now we sample . At this point we proceed to step above, and keep the next two tokens only if is accepted.

We can generalize this to speculating on the future tokens given the prompt . Let be more computationally expensive then .

1. Generate by sequentially decoding one token after another, starting from the prompt .

2. Evaluate in parallel for all then compare with and the relevant Bernoulli variables to determine the first rejection index and sample from the adjusted distribution.

3. Take for all .

EAGLE 1-3

TBD.

Multitoken Prediction

TBD.

Paged Attention

Inference requires efficient KV cache memory managment.

Ring Attention

Long context.