Consider how a real number can be represented on a computer. If then where is the sign of and can be represented with one bit . For the positive number there exists a smallest such that . Now we can divide this interval into two equal halves. By definition of we know that . We can again divide the interval into two equal halves and lies in either the left half or the right half (to break the tie, if lies in the middle, then we say it lies on the right half). As we make the sequence of interval partitions, the positive real number lies in either the left or the right half. Associate left with right with we obtain a binary sequence . It is clear that for the -th partition the situation is where is less than the length of the half interval in the -th partition. So the positive real number , and a desired precision , we can choose an integer such that so that the number which is representable by a binary sequence of length , approximates to within , namely Since we can write Hardware approximates by storing one bit for the sign of , another bits called the mantissa of , the bit length of which determines an upper bounds on the error of approximating by , together with a representation of the exponent .
Now we need a scheme for representing the integer in bits. Suppose that bits are designated to store the exponent, then there are configurations, meaning that the real number the float can approximate spans orders of magnitudes. Naturally we want to choose orders of magnitudes about the multiplicative unit . Customarily we designate of the orders of magnitudes to be at or below the unit, and another orders of magnitudes to be above the unit. The hardware implementation is to store the integer exponent with an non-negative integer such that this way the smallest order of magnitude (with respect this dtype) is stored as whereas the largest order of magnitude is stored as . This scheme is nothing but a bijection from The quantity is sometimes given the nondescript name bias just to be confusing! The mantissa , which we have until now defined as the bit string should really be identified with the real number that it represents and together with and the sign represents the real number (approximately) on hardware as
The above analysis applies to . Since is a limit point of the set there does not exist a smallest integer such that . Suppose that a float dtype has bits to store the mantissa, it is natural to approximate zero with the smallest element in the set the minimum value being and corresponds and . Equivalently zero can be approximated by the maximum of the set consisting the negative elements of the set . Therefore we have two representations of zero by the dtype. Now this argument is true to first-order, and needs slight modification when we introduce the so called subnormal numbers in practical floating point implementations. The idea is that is not the smallest order of magnitude we can represent if in our implementation we cutomzarily agree that when we replace by where observe that whereby is the smallest normal order of magnitude. If the mantissa has bits, then the smallest subnormal number is and the largest subnormal is corresponding with mantissa and respecitively. The case with is reserved to represent zero. Observe that such smaller than magnitudes are only expressible near zero, with subnormal numbers. That is, for a general real numbers , the upper bound of error of approximation is still even though for numbers below the smallest normal the error of approximation has bound .
Let us build some intuition. (1) observing that every mantissa bit of a normal number is significant but for subnormal number this need not be the case. For instance for a normal number with , every bit is signitficant. For the subnormal number with mantissa every bit is significant, but a subnormal with only has 1 significant bit. The leading zeros are placeholders and not significant. (2) A hand wavy way to look at gaps between int and floats. Let , the gaps between consecutive Int- numbers are the same, whereas for Float- numbers with bit mantissa, the ratio of gap sizes between nearby numbers are evenly spaced on a log-scale. This is meant in the sense that for a float the nearby elements are spaced . In particular these gaps are larger by a factor of depending on the exponent for the float .
Let . We say that a floating point data type has bit structure ExMy if every float is represented (in hardware or software) as 1 sign bit together with exponent bits and mantissa bits. It is sometimes the convention that if then the number has no sign bit.
Since any hardware floating point number consists of a finite number of bits, the float has a limited range of expressible order of magnitudes. For numbers outside this range, a data type sometimes designate bit patters for the sign, mantissa, and exponent to represent . However this is optional, for example in the so called microscaled (MX) classes of floats, infinities are excluded in E4M3 MXFP8 but are included in E5M2 MXFP8. Other special values for a float are NaN or not a number, which designate the results of undefined operations such as . How to implement NaN for a given float is a choice. In some designs, NaN is omitted, while in others only one bit pattern (up to equivalence in sign) denotes NaN, and yet in others, multiple such pairs are treated NaN (compare E2M3, E4M3, E5M2 microscaled floats).
The aforementioned microscaled floats works as follows, instead of representing one real number at a time, MX floats represents collections of real numbers simultaneously. Suppose , with . Let be the set of all floats, and let be the set of all floats. The set is called the -dimensional microscaling (MX) float with scalers of type and block of type . An element is associated with . There are dtype specific rules like for all , and implementation defined rule for situations such as when . Within microscaled data types there are floats MXFP4 where scalars are of type E8M0 and of type E2M1 and dimension togehter with MX integers like MXINT8. Encodings for mantissa are somewhat different than our discussion so far, in the sense that the mantissa has factored out ( being the number of mantissa bits), so that so the unsigned integer is stored as opposed to . In this scheme a normal number has and value and a subnormal has and
The dot product of two MX floats of the same dtype corresponds the dot product of the vectors they represent: for every and every we define their dot product to be The product between an element of and an element of and likewise the product between elements of and elements of are specified in the implementation. Specification of the output type depends on the situation.
Let . More generally dot product can be defined between every vector and every by This dot product requires the length of the vectors to be a multiple of the number of elements per MX block. This can be relaxed to any length by padding to the nearest multiple of greater than and truncation the result back to length .
To represent a vector in MX with maximal value the corresponding scale factor is determined by In other words, the scale factor depends on and the largest normal number for the block data type. The scale factor is the ratio of the largest power of two less than or equal to to the largest power of two less than or equal to . Observe that the scaling factor of MX is an integer power of (hence in MXFP8 the block scale are E8M0 integers).
Let be the map that quantizes real numbers to with the scheme in the begining of the blog, and which assigns values exceeding the maximal value to and those values below the minimal normal value to . The corresponding block values