Intro Blackwell GEMM
Let be a matrix and be a matrix. We are interested in computing the matrix product . Now hardware instruction does not compute arbitrary shaped matrix multiplication at once, so we are really interested in breaking the product into smaller chunks.
Block Matrices
We can partition into block matrices along its column direction where has dimension and has dimension with . Likewise we can partition along its row direction for a block matrix with dimension and a block matrix with dimension such that . Thus in general a matrix can be partitioned in both directions, and as for the product we can partition each matrix in both direction a suitable number of times so that is partitioned into a block matrix becomes a block matrix, and is a matrix. The shapes of these matrices are as follows such that The matrix product, in terms of block matrices, is For folks working in numerical computing, the block matrices are called matrix tiles, and say that has problem size . For convenience one usually chooses and says that the tiler size is . When realizing the block matrix multiplication on hardware, each block of is computed as a summation, which in numerical methods lingo is called a main loop where each iteration which functions to accumulate the partial products to a buffer. Each accumulation step is called a stage.
Cutlass and CuTe
The philosophy of computing matrix product on a GPU is to apply block multiplication to leverage concurrency as fully as possible with respect to the parallel units of execution supported by hardware such as the CTA, the warp, the MMA instruction, and partial products are accumulated at different memory state spaces like tensor memory, shared memory, and register memory. Eventaully, results are stored back to global memory.
Example
Consider a matrix multiplication with problem size . For the CTA level of concurrency we consider a block matrix multiplication with tile sizes . This corresponds which means is divided into blocks, each computed a sum over partial products.
To schedule this onto the GPU, we launch thread blocks. Each thread block initiates a tensor memory of size to accumulate the partial products of this same. For each partial product we load a block matrix of of size and a block matrix of of size into the shared memory (smem).
Suppose each MMA matrix multiply instruction can deal with size of , then we need to partition the smem block of and into four equal parts along the reduction dimension so to match the size of the MMA instruction.
Specifying the Problem
To compute the matrix product we first need to fix the IO data type for each of the matrices. We then need to specify the accumulator data type, this means setting the datatype for summation (math terminology) or reduction (numerics lingo) both of which refer to the operation where are entries, or blocked matrices in . A common accumulator dtype is FP32 to avoid overflow in the summation process. When the accumulator dtype differs from 's IO type then conversion is required. To produce the block matrix data we use the tensor memory accelerator to load some number of blocks to shared memory, this number is determined by the size of the smem and is sometimes called the number of TMA pipeline stages. The block matrix product is accumulated on tensor memory, and there are some number of accumulation stages, which is one in our example.
Having specified the data and IO, we turn to compute specifications. We provide the shape for the matrix multiplication instruction for 5th generation TensorCore and specify the block matrix size (aka tiler size in cutlass world) that either one or two CTAs will process at once (the number of CTAs is called the issue granularity of the instruction). Further we decide the number of threads in each CTA: to ensure full SM usage we use at least the number of threads in a warp group (128).
Host Code
We now describe the host code, which runs on the CPU, and is used to lanuch the GEMM kernel on the GPU.
The following specifies the matrix multiply and accumulate instruction to use
op = tcgen05.MmaF16BF16Op(
io_dtype,
acc_dtype,
mma_inst_shape_mnk,
tcgen05.CtaGroup.ONE, # issue granularity
tcgen05.OperandSource.SMEM,
tcgen05.OperandMajorMode.K, # matrix A reduction dimension
tcgen05.OperandMajorMode.K, # matrix B reduction dimension
)
tiled_mma = cute.make_tiled_mma(op)
 
To specify shared memory layout for the matrix . This layout has at least modes, where the first mode is the shape of the block matrix for a single MMA instruction. The next two modes describe how many times the MMA is repeated traversing the row and columns of . This layout is swizzled to avoid smem bank conflict.
a_smem_layout = cutlass.utils.blackwell_helpers.make_smem_layout_a(
tiled_mma,
mma_tiler_mnk,
a.element_type,
ab_stages,
)
a_smem_layout_one_stage = cute.select(a_smem_layout, mode=[0, 1, 2])
 
Tensor memory accelerator to load from global to shared memory state space is configured as follows
op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp(tcgen05.CtaGroup.ONE)
a_tma_atom, a_tma_tensor = cute.nvgpu.make_tiled_tma_atom_A(
op,
a, # the matrix A
a_smem_layout_one_stage,
mma_tiler_mnk,
tiled_mma,
)
 
Then the kernel is lanuched, specifying the thread block and grid dimensions as appropriate.
GEMM Stages
Before delving into device code let us give a conceptual overview of what happens during the GEMM computation.
The prologue is what happens before the matrix multiply instruction occurs. The most important function is to load the data via TMA. This is done by performing the necessary indexing: block, thread, warp ID, which is useful for locating which block matrix the MMA will calculate, which data it should load, the TMA and MMA tensor view. The relevant shared and tensor memory needs to be allocated. Setting up pipelines PipelineTmaUmma for consumer-producer between data loading and MMA, PipelineUmmaAsync for signally accumulation completion.
The mainloop iteratively fetches data, computes MMA and accumulates across the reduction dimension.
The epilogue loads data from tensor memory to register, fuse operation on matrix and performs necessary data conversion. This stages deallocates tensor memory and stores results back to global memory.
Device Code
Xue J. Zhao © 2026