Before delving into device code let us give a conceptual overview of what happens during the GEMM computation.
The prologue is what happens before the matrix multiply instruction occurs. The most important function is to load the data via TMA. This is done by performing the necessary indexing: block, thread, warp ID, which is useful for locating which block matrix the MMA will calculate, which data it should load, the TMA and MMA tensor view. The relevant shared and tensor memory needs to be allocated. Setting up pipelines PipelineTmaUmma for consumer-producer between data loading and MMA, PipelineUmmaAsync for signally accumulation completion.
The mainloop iteratively fetches data, computes MMA and accumulates across the reduction dimension.
The epilogue loads data from tensor memory to register, fuse operation on matrix and performs necessary data conversion. This stages deallocates tensor memory and stores results back to global memory.