The first strategy is a kernel fusion of the input gather (during token routing) with input loading from global memory (HBM) to shared memory (SMEM). During this phase the token indices are first gathered, then the activations at those indicies are obtained via the cp.async PTX instruction. The authors of SonicMoE note that on Blackwell architecture, the 2-CTA GEMM requires the leader CTA to wait for gather to be complete in both CTAs, which leads to the following pipeline structure of the two CTAs: 1 warp to fetch token indices, 4 warps to gather, then 1 warp to relay the signal and perform MMA.
The second fusion occurs during the epilogue associated with the MoE layer. Specifically, the SwiGLU is fused with forward up-projection, while the backward of SwiGLU is fused with the down-projection gradient computation.
Thirdly fusion involves the and associated with the scores and down-project , respectively.