Programming Blackwell Tensor Core with Cutlass and CuTe DSL
The CuTe domain specific language (DSL) provides a Python interface that enables efficient matrix muliplication GPU kernels implementation leveraging the tile based abstraction. To gain a clear understanding of how the DSL provides an interface to highly efficient kernel, we examine hardware mechanisms, memory structures, CUDA programming patterns, and the DSL abstractions. In particular, the DSL offers Blackwell specific support in the cutlass.cute.nvgpu.tcgen05 module. Since our discussions about this module will not shy away from hardware, we will also take excusrions of Blackwell tensor core programming at the more fundamental parallel thread excution (PTX) virtual instruction set level.
Fundamentals
A cooperative thread array (CTA) is a set of threads that excute a kernel concurrently or in parallel. A warp of a CTA is a maximum subset of the set of threads in the CTA that can execute the same instruction simultaneously. In common hardware, a warp is a thread subset of the CTA. Threads in a CTA can be organized into a block of 1D, 2D, or 3D block of dimensions and each thread indexed by a tuple . If then the CTA is a 1D block and so forth. One can enumerate the threads in an CTA by We can enumerate all threads in a CTA and every four consecutive warps where the begining thread has thread ID is called a is called a warp group. The warps in a warp group are associated with a warp rank from .
The occupancy of a kernel is the ratio of active warps to the maximum warps supported by a streaming multiprocessor (SM).
Threads in a CTA have access to a memory space called shared memory which is structured into banks to faciliate parallel access by a warp. When a shared memory bank is accessed by multiple threads in the same warp, then access can no longer be parallel as single access, and must be serialized, this situation is often described as a bank conflict.
Global memory is accessed in units of bytes. Thus to achieve a ratio of bytes used to bytes accessed close to we want to coalesce global memeory access.
Async Pipeline
When solving a problem like matrix multiplication on GPU, the program first loads a small matrix block (or a tile of data as it is often refered to in Cutlass terminology) from global memory to shared memory, then a block matrix multiply is performed on the loaded data. This follows a so called producer-consumer pattern, and an important performance optimization is to leverage async to our advantage when it comes to overlapping load, multiply, and store. The CuTe DSL exposes this pattern in PipelineAsync, PipelineProducer and PipelineConsumer.
The producer uses the following mechanisms to wait for an opportunity to write to data to memory where the consumer can then use (this is blocking), and how ot signal to the consumer that the data is ready to be used.
The consumer must wait (blocking operation) for producer to get data to do its work, and then to release the memory for producers to use. The mechanisms are
In the above pattern, the producer waits for data to be used by consumer before doing more work, this is a serialization point which leads to suboptimal performance. Additional parallelization can be introduced with multi-stage async pipeline by juggling multiple buffers each with associated memory barrier. Some relevant APIs are
# advance producer write index to next buffer
PipelineProducer.advance()
# buffer slot index
PipelineProducer.ImmutableResourceHandle.index
# advance consumer read buffer
PipelineConsumer.advance()
PipelineConsumer.ImmutableResourceHandle.index
The stage specification works as follows
pipeline = cutlass.pipeline.PipelineAsync.create(
num_stages=stages, # multi-stage
producer_group=producer_group,
consumer_group=consumer_group,
...
)
Tensor Memory
For each cooperative thread array (CTA), the structure of tensor memory is a matrix of bit cells. Each of the rows in tensor memory is also referred to as a lane, which is KB in size. Thus the lanes are in bijective correspondence with the threads in a warp group. In fact the lanes are divided between the warps, so that warp rank can only access lanes using ld load, st store, cp copy tcgen05 instructions which are issued on inputs of lane size. Each warp can access all columns within is accessible lane. Tensor memory allocation and deallocation use the alloc and dealloc instructions and are allocated in units of columns, and the number of allocated columns must be a power of . Thus when a column is allocated, each of its lanes are allocated.
Tensor Memory Accelerator
The instruction cp.async.bulk.tensor is used to copy tensor asynchronously between different types of memory locations (ex. from global to shared memory and back). This is facilitated by a hardware unit called the tensor memory accelerator (TMA). CUDA exposes this instruction via cuda::memcpy_async which can be issued by a single thread within a warp and the synchronization mechanism is cuda::barrier. The PTX instruction also allows copying from global memory to the shared memory of multiple CTAs in a cluster, this is so called multicasting and is done by invoking the instruction modifier .multicast::cluster. One can even copy to distributed shared memory.
Matrix Multiplication Instructions
Matrix multiply and accumulate (MMA) are excuted by mma instruction which depend on a number launch configurations, such as the data type of input pair of matrices and their shapes, the data type of accumlation or output matrix. Support for warp-specialization, whether one or two CTAs cooperate, and whether the matmul is dense or sparse. For instance 2CTA FP16 dense MMA can have largest shape while NVFP4 dense MMA has largest shape
MMA configuration are stored on register as a 32 bit instruction descriptor. For example bits 13 to 16 describe whether the input matrices needs to be transposed or negated. Other data store includes sparsity, input output data type.
MAA supports block scaled matrix multiplication for data types such as mxf4 and nvf4.
Granularity and Synchronization
TensorCore Gen 5 Instructions such as mma, cp can be issued by a single thread within a CTA or CTA-Pair. Instructions such as alloc and dealloc can be issued from a warp of a pair of warps one in each CTA-pair. Each warp can access a quarter of the tensor memory via ld and st so a warp group is required for complete tensor memory access. These characterize the instruction issuing granularity.
Instructions can be synchronous (alloc/dealloc, fence) or asynchronous (cp, mma, ld, st). Special pairs of asyncrhonous instructions, called pipelined instructions are guarenteed to excute in the order they are issued, an example is tcgen05.copy.cta_group::N and tcgen05.mma.cta_group::N for the same N.
CuTe DSL
# useful methods in cutlass.cute.nvgpu
from cutlass.cute.nvgpu import (
CopyUniversalOp, MmaUniversalOp,
make_tiled_tma_atom_A, make_tiled_tma_atom_B
)
# warp level operations
from cutlass.cute.nvgpu.warp import (
MmaF16BF16Op, MmaMXF4Op, MmaMXF4NVF4Op # MMA
LdMatrix8x8x16bOp, StMatrix8x8x16bOp # load & store (&more for other shapes)
)
# warp group
from cutlass.cute.nvgpu.warpgroup import (
MmaF16BF16Op,
make_smem_layout_atom,
fence, commit, wait_group
)
# TMA
fromclass cutlass.cute.nvgpu.cpasync import (
CopyBulkTensorTileG2SOp, CopyBulkTensorTileS2GOp # G2S <-> S2G means global and shared memory
CopyBulkTensorTileG2SMulticastOp, # plus multicast copy in reverse direction
make_tiled_tma_atom,
)
# Pipeline
from cutlass.pipeline import (
PipelineProducer, PipelineConsumer,
PipelineAsync, # async producer-consumer pattern
PipelineTmaAsync,
PipelineTmaUmma,
PipelineAsyncUmma,
PipelineUmmaAsync,
)
# useful util
from cutlass.utils import (
SmemAllocator,
TmemAllocator,
HardwareInfo,
print_latex,
print_latex_tv
# also things like cutlass.utils.sm100.make_smem_layout_a