Claire Zhao Blog

Pretraining LLMs with NVFP4

Jan 20

Definition

Hardware represents floating point numbers in the form where determines the sign of the number and is stored with one bit. Depending on the datatype, a choice of bits are devoted to the mantissa , and bits are devoted to the exponent . The datatype thus determined requires bits to represent a float, and is denoted EeMm. For instance, FP32 is E8M23, FP16 is E5M10 where as BF16 is E8M7.

Quantization is the operation that maps numbers repreented in a given datatype to numbers in another datatype requiring less number of bits. Thus quantization is a compression mechanism that reduces storage and communication footprint, and increase compute throughput. The efficiency gained via compression is to be trade-off with degradation in accuracy in one form or antoher.

In practice one quantizes a set of numbers together. For example, to quantize real numbers into bit integers , we can choose the mapping In particular the scale factor is chosen so that the maximum element of gets mapped to . Since the mapping is many-to-one we can only hope to dequantize approximately, with error. More precisely, given an integer we can map it to and this introduces an error .

In general given a pair of quantization and dequantization maps one can measure the error .

Trainng in NVFP4

Based on the papers and documentations I can find, here is a sketch of training LLMs in NVFP4.

Like FP4 and MXFP4, NVFP4 has the bit strcuture of E2M1. The distinction lies in how NVFP4 represent a set of numbers, which we refer to as a tensor. In particular, NVFP4 partitions into subsets of numbers each called a block. Each block is associated with an 8-bit E4M3 number called a block scale factor such that each one of the four bit numbers belonging to the same block is reconstructed with . Additionally, a FP32 number is asspciated with the tensor itself, called the tensor scale factor.

By contrast each MXFP4 partitions consists of numbers, and its block scale factor is E8M0 (i.e. round to the nearest power of two). It can be shown that the expected square error with E8M0 is larger than that of E4M3, with the tradeoff being E8M0 has less overhead. The said parititon and scaling in NVFP4 is handled by specialized tensor core harware.

Given that E2M1 and E4M3 can represent numbers with maximum absolute value of and respectively, the tensor scale factor for a tensor indexed by a set is the tensor dequantization scale is and is stored in FP32. Let be an indexing set for a block in , the corresponding block scale factor that is In fact, the block dequantization scale factor is stored in FP8 on the tensor core as

Each block gets quantized as and partial dot during GEMM product is computed as where After GEMM the tensor dequantization scales are applied.

There are experiments showing NVFP4 should be used in earlier layers of the forward pass direction of a transformer, while keeping later layers in higher precision.

Random Hadamard Transform

Hadamard matrices of dimension for an integer satisfies and . We shall consider a randomized Hadamard matrix where is a diagonal matrix of values chosen uniformly at random. In training, instead of operating on tensors one qpplies the above NVFP4 on the random Hadamard transformed tiles of the tensor. In some experiments is applied to inputs of weight gradient GEMM.