Xue J. Zhao Blog

LLM Backward Pass

Feb 1

Modern LLMs architectures have two core computations: the attention operator, which exchanges information between tokens, and multilayer perceptron (aka dense or MoE feedforward layer) which is a position-wise operation. Both these operators have three phases: forward pass, backward pass (used interchangably with backpropagation), and optimization (weight update using gradients).

In MLP forward pass, a fundamental math operation is projection through a layer with weights matrix defined by . In attention forward pass, there are several matrix multiplications such as . So the key question for backward pass is how to backpropagate through a matrix product?

Activation Gradient and Weight Gradient

In LLM forward pass, the activation of one layer is defined as the output of that layer and it becomes the input of the next layer . If computes then is in fact the activation of . During backwards pass through , the gradient of the loss with respect to needs to be computed. This gradient with respect to the weights is called weight gradient. During backpropgation through we additionally need to compute the gradient of the loss with respect to in order to later compute the weight gradient associated with . This gradient with respect to is called activation gradient. By contrast, when backpropagating through an operation such as in the attention operation, the gradient with respect to and are both activation gradients, since these are not the gradient of any weights in the neural network, but rather gradients used to compute weight gradients further along backward pass.

Interlude

In LLM literature, the derivative of a real valued function defined on a set of matrices is defined as the matrix whose entries are the partial derivatives . While the dimension of this matrix is useful for implementing neural network optimization in a framework such as PyTorch, is not the Jacobian matrix (i.e. the mathematical definition of the derivative of a function), and the chain rule is not conveniently expressed in terms of . To ensure correctness in our derivations, let us specify a few definitions to make clear this distinction.

Let be an open subset of , and given a mapping . We say that is differentiable at a point if there is a linear mapping , described by a matrix such that for all in an open neightbourhood of the origin in we have where here is norm or any of its equivalents on or . If this condition is not satisfied then the function is not differentiable at that point, and a sufficient condition for differentiability of at is the existence and continuity of all first order partial derivatives of at . Recall that is the -th component function of . If the derivative exists then is the matrix of partial derivatives and is called the derivative (or Jacobian matrix) of at and is denoted . We write if has continuous first order partial derivatives everywhere on , in which case we also say is smooth when the domain is understood tacitly from the context.

Chain rule says that if are open subset of Euclidean spaces (possibly of different dimension), and and satisfies smoothness conditions . Then is differentiable at all points and the derivative (i.e. Jacobian, math version) is given by the matrix product

Let us compare the ML version and the math version for the derivative of a matrix input, real valued function . The ML version is a matrix of partial derivatives with size matching the input. The Jacobian has shape . In order to apply the chain rule to the ML version, it is convenient to use the special case of the chain rule where and and . This is called the component-wise chain rule and it says In particular for all we have Sometimes we will use the Einstein summation notation to express the above as where the sum symbol is omitted in the expression on the right for clarity and it is understood that summation occurs over all repeated indices within the same expression ( in this case).

Problem Formulation

The question how to backpropagate through a matrix product is made more precise as follows. Let be smooth, and let be defined to be the matrix product , when any vector is considered as a matrix, and likewise for and the product element . The map is smooth because each component is a polynomial in its input. Define . We can regard Then we know that (the math version derivative) exists and has shape . For the ML learning version of the derivative we want the reshape as a block matrix where is the matrix of partial derivatives of when is fixed, and is the matrix of partial derivatives of when is fixed. We thus introduce the matrix derivative notation (i.e. this is the ML version of the derivative) and

The logic behind shaping the gradients this way is as follows. We enumerate the layers of a nerual network in the direction of forward pass , where the first layer receives the inputs (tokens say), and the final layer has loss as output. For intermediate layers takes as input the activation of the previous layer and weight and computes . This is the activate of which becomes the input of subsequent layers until the loss is computed as where . Therefore is the weight gradient and is the activation gradient.

Derivation

The hardest part was to formulate the problem precisely as above. The solution is rather straight forward. Using our notation, backward gradient computation happens from to . Focus on an intermediate layer with that computes as the forward pass, we have where is the activation gradient computed during backward pass through . So we first need to compute .

Recall that is an matrix and is a . For all and we have (using einsum) For all and and for all and where when and zero otherwise. Using the componentwise chain rule defined above, the components of are and The above two lines specify how components are computed, in matrix notation and We can even sanity check the matrix shapes agree.

That's it! In reflection, the backward pass derivation takes just two lines, one for the activation gradient and one for the weight gradient. These gradients at layer are computed using the transpose of the input matrices which needs to be kept in memory or recomputed during backward, as well as the activation gradient from the backward pass through .

Backwards Pass Through Attention

Let us look at how to backward through scaled dot product attention Softmax is applied over the rows (dim ) of . Thus let be row in the matrix with the same shape as then Each row of becomes a probability distribution which is used as coefficients in a linear combination of the corresponding rows of the value matrix There are no weights, so we compute activation gradients, which we can compute using the formula for and we derived in the previous section To see how to differentiate through softmax, let be the row-wise softmax function on matrices . Let be the rest of the forward pass after that results in the loss . Observe that the components in are precisely the components in the activation gradient matrix we derived above where the right side denotes the matrix entry in row and column . The partial derivatives of the row-wise softmax are

In particular this says otherwise This can be restated as follows, if is a vector, and then its derivative (Jacobian) is Returning to calculating using the componentwise chain rule we defined above Thus where is the -th row of the matrix . As a sanity check let us inspect the dimentions Finally the remaining backward pass are differentiation through a matrix product, and using the formula we derived earlier, we have