Xue J. Zhao Blog

Optimizers

Mar 3

Let , define componentwise multiplication and componentwise divison if for all . The notation .

AdamW

This is a trajectory dependent optimizer. Let denote the model's gradient vector at step . We keep two sequences and defined by Compute The new weight is Sometimes a weight decay regularization term is present

Memory Usage and Lifetime

The memory consumption of AdamW is evident from the weight update equations , namely to compute the trainer needs to hold the following vectors in memory: which are known as optimizer states, as well as the gradient , and weight .

While the weight, and optimizer states are persistent throughout training, the gradients are temporary and is discarded after weight update during backward pass. Another temporary memory usage is the activation memory of each layer containing weights, which is allocated during forwards pass, and deallocated whenever that layer's gradients have been computed. The activation memory is batch size dependent: consider a linear layer that implements . Let be the loss function. Then for a batch size input the layer computes , the activation saved is (batch size dependent) which is used to compute Recall the derivative of a real valued function with respect to a matrix is a matrix of the same shape whose components are the derivative of the real valued function with respect to that matrix component, i.e.

To manage GPU memory, activation memory of a layer is sometimes cpu offloaded or discarded after forward through that layer then recomputed on the fly during backwards through the layer. Temporary memory causes GPU memory to flucutate during training forward and backward, and cycles from one training loop iteration to the next.

Note that we specified memory usage in terms of the dimension of vector being stored. In practice the memory in bytes is determined by parameter count and precision (bytes per paramter). Let us ignore activation memory, and compute some bounds on memory usage for a LLM with parameters. Suppose the lowest and highest precision we use throughput the training pipeline are FP8 (1 byte) and FP32 (4 bytes). Then to update a single parameter, the optimizer stores (that parameter, its gradient, and the corresponding 2 optimizer state values), and the memory in bytes, per parameter, satisfies Thus the bounds on total memory allocated for optimizer of the model satisfies Converting with , optimizer memory for training a billion parameter LLM is around GB.

Flash Optimizer

TBD.

Muon

TBD.

Lion

TBD.

Soap

TBD.