In LLM literature, the derivative of a real valued function defined on a set of matrices is defined as the matrix whose entries are the partial derivatives . While the dimension of this matrix is useful for implementing neural network optimization in a framework such as PyTorch, is not the Jacobian matrix (i.e. the mathematical definition of the derivative of a function), and the chain rule is not conveniently expressed in terms of . To ensure correctness in our derivations, let us specify a few definitions to make clear this distinction.
Let be an open subset of , and given a mapping . We say that is differentiable at a point if there is a linear mapping , described by a matrix such that for all in an open neightbourhood of the origin in we have where here is norm or any of its equivalents on or . If this condition is not satisfied then the function is not differentiable at that point, and a sufficient condition for differentiability of at is the existence and continuity of all first order partial derivatives of at . Recall that is the -th component function of . If the derivative exists then is the matrix of partial derivatives and is called the derivative (or Jacobian matrix) of at and is denoted . We write if has continuous first order partial derivatives everywhere on , in which case we also say is smooth when the domain is understood tacitly from the context.
Chain rule says that if are open subset of Euclidean spaces (possibly of different dimension), and and satisfies smoothness conditions . Then is differentiable at all points and the derivative (i.e. Jacobian, math version) is given by the matrix product
Let us compare the ML version and the math version for the derivative of a matrix input, real valued function . The ML version is a matrix of partial derivatives with size matching the input. The Jacobian has shape . In order to apply the chain rule to the ML version, it is convenient to use the special case of the chain rule where and and . This is called the component-wise chain rule and it says In particular for all we have Sometimes we will use the Einstein summation notation to express the above as where the sum symbol is omitted in the expression on the right for clarity and it is understood that summation occurs over all repeated indices within the same expression ( in this case).