Mathematics of Diffusion Models
discrete diffusion
score matching
flow matching
Training Infrastructure
Megatron Core
MoE training
Systems
Insights on Rotation Based Position Embedding
context extension
YaRN
RoPE
Asynchronous RL
off-policy RL
post-training
reasoning
6D Parallelism for Distributed Training
parallelism
distributed training
infra
Backward Pass Through LLM
The math behind LLM training and the prerequisite to designing optimized training kernels.
theory
The Fokker Planck Equation
Switching lens between SDE and operator views of the Fokker-Planck equation in diffusion models.
ml-theory
diffusion model
old-blog
Optimal Transportation and Diffusion Models
Switching lens between SDE and operator views of the Fokker-Planck equation in diffusion models.
ml-theory
diffusion model
old-blog
Low Precision LLM Pre-training with NVFP4
mixed-precision
quantization
engineering
LLM Inference Optimizations
inference
systems
optimization
Optimizers
optimizers
memory efficiency
training infra
Mamba
architecture
Programming Blackwell GPU
gpu kernels
cutlass
cute dsl
Engram and LLM Memory
emerging architecture
scaling law for memory
paper reading
Multimodal-Pretraining
multimodal foundation model
pretraining
paper reading
Blackwell GEMM
gpu kernels
cutlass
cute dsl
Time Reversal SDE in Diffusion Models
Heurestic for reversing time in diffusion process.
ml-theory
diffusion model
sde
Matrix Calculus
Matrix derivative, Laplacian, polar body, convexity theorems
math
Operator Identities
Concerning extreme eigenvalues of some linear operators between Euclidean spaces.
math
Primal Dual Langevin Monte Carlo Algorithm
ml-theory
optimization
old-blog
Xue J. Zhao © 2026