Xue J. Zhao Blog

Mathematics of Diffusion Models

Mar 24

Let us review some basic mathematical tools and ideas, so that our diffusion model and dLLM discussions can be made precise.

Discrete Time Markov Chain

Let be a countable set, called the state space, and call each a state in . We say that is a measure on if for all . A measure on with total mass is called a distribution on . A matrix is called stochastic if every row is a distribution.

We work with a fixed probability space , which we will short hand as from now. A random variable with values in is a function . Put then defines a distribution on , and we can think of as modelling a system whose probabilty of being in state equals .

A sequence is a Markov chain with inital distribution and transition matrix if has distribution , namely and for conditioned on , the distribution of is and is independent of , namely Both the above chain as well as a finite chain will be denoted .

One can prove that a discrete time random process is if and only if for all Let be the unit mass at where if and whenever . The Markov property is the observation that if is then conditioned on , is and is independent of .

Whenever we write and by the Markov property at time the chain is so properties of the chain with respect to is independent of .

A random variable is called a stopping time if for all integers the event only depends on . The strong Markov property is the assertion that the Markov property holds for stopping times, more precisely, if is and is a stopping time of the chain, then conditioned on and the chain is and is independent of . An example of a stopping time is the first-passage time which appears in the ergodic theorem below.

Whereas specified one-step transition probabilities, we want to know the -steps transition probabilities. For this we first need to define countable-infinite matrix-matrix product, in the obvious way: regard as a row vector, and define Define to be the identify matrix with . Likewise we can define higher powers whose components are denoted by .

Let be . Then for all

In light of we call the step transition probability from to .

A measure is invariant with respect to if . Suppose be invaraint to and let be . Then for all integers is also . Let be finite. Suppose that there exists such that Then is an invariant distribution.

We say that state leads to state and write if there exists such that We say that communicates with and write if and . Communication classes are equivalent classes of . A chain or a transition matrix is said to be irreducible if its state space consists of a single communication class. Let be irreducible and have an invariant distribution . Suppose is , put . Then is where is given by and is irreducible with invariant distribution . is called the time-reversal chain of

The number of visits to a state in steps is defined Recall the first passage time above, and define . A state is reccurent if If additionally the expected return time is finite then we say that is positive recurrent. It can be proved that for an irreducible transition matrix that the existence of a single positive recurrent state is equivalent to the positive recurrence of all states. Using these we can state the following ergodic theorem

Let be irreducible and let be any distribution. If is then If the chain is positively recurrent, then for any bounded function we have where

Martingales

Let be a Markov chain. Let be the collection of all measurable sets in the probability space which depend only on . The sequence is called the filteration of the Markov chain . A process is called adapted if depends only on . A process is said to be integrable if for all . An adapted integrable process is called a martingale if for all and all . In this case of the Markov chain , consists of countable unions of elementary events so the martingale property is equivalent to for all states and all .

In terms of filteration, random variable is a stopping time if for all .

Let be a martingale and let be a stopping time. If there exists such that or and whenever then This result is called the optional stopping theorem.

Brownian Motion and Markov Process

A real valued process where is called continuous if A continuous real-valued process is called a Brownian motion if and for all the increments are independent Gaussian random variables of mean and variance . Brownian motion started from is any process where and is a Brownian motion. The Brownian motion on is a process whose components are indepdent Brownian motions in 1D.

A theorem of Wiener says that Brownian motion exists. Using the central limit theorem, it can be shown that Brownian motion shows up as a universal scaling limit of random walks, the precise statement is as follows:

Let be a discrete-time, real-valued random walk with steps of mean size and variance . For put whereby at non-integer , the value of is the linear interpolation between the two neast integer times. Then for all bounded continuous functions and all we have

The central limit theorem used in this proof says the following: let be a sequence of independent and identically distributed random variables with mean and variance . Then for all bounded continuous functions where is the Gaussian density with mean and variance .

For any bounded measurable function where is the transition density. The transition semigroup is given by in particular it satisfies the semigroup property . Observe that for If has bounded derivatives then As the expectation approaches which indicates the generator for Brownian motion is One can see how this generator shows up in the martingale characterization of Brownian motion:

Let be a continuous process taking values in with filteration . Then is a Brownian motion if and only if for every bounded function which has bounded first and second derivatives, the following process is a martingale

The invaraint measure for Brownian motion is the Lebesgue measure . The convergence to equilibrium and ergodic theorems for Brownian motion assert the following:

Let be a Brownian motion on and let be a continuous periodic function such that Then for all and

Continuous Time Markov Chain

Let be a countable set. A continuous time random process with values in is a family of random variables . The process is called right-continuous if for all and there exists such that for all The visual picture of a right-continuous process is a kind of multi-step function, where the path is constant for sometime, and then jumps. The places where transitions are made are unsurprising called the jump times of the process where and where we define . The duration of times the process stays constant are called the hoding times where if else . To such as continuous-time process with right-contiuous paths we can associate a discrete-time process called a jump process defined by where . If this process is a discrete-time Markov chain then we call it a jumpy chain. It turns out that each continuous time Markov chain (we define rigorous below) has jump process that are discete-time Markov chains, and holing times that are independent exponential random variables. Recall that a random variable has exponential distribution with parameter if for all .

It can be shown that the probability of any event dependent on a right-continuous process can be determined by the finite-dimensional distributions i.e. by knowing for all , and times , and all states . Let be a countable set. A -matrix is a matrix such that

for all

To any -matrix we can associate a jump matrix where off-diagonal entries are if else . For diagonal entries if else . The jump matrix is stochastic, and as we'll see, the jump matrix plays the role of transition matrix for the jump chain of a continuous-time Markov chain.

Let be a right continuous process with values in a finite set . Let be a -matrix. We say that is a continuous-time Markov chain with generator matrix if for all all times , and all states where is the solution of the forward equation is called the semigroup of the chain. If is distributed according to then we write for the continuous time Markov chain. In particular the transition probability from state to in time is given by entries in the matrix The genenerator governs the infinitesimal transition in the sense of

For addition precision, let's first make sense of what it means to raise a matrix to a non-integral exponent. For any matrix the following series converges componentwise Thus for a matrix , we can define non-integral powers , by first finding a such that and then define using the matrix exponential. Let be a matrix on a finite set , put . Then satisfies

is the unique solution to the forward equation

is the unique solution to the backward equation

for

One can prove that a matrix on a finite set is a -matrix if and only if is a stochastic matrix for all .

In a picturesque formulation in terms of jumpy chain and holding times we can say that a continuous-time Markov chain is a right-continuous process with values in a finite set , whose jump chain is a discrete time Markov chain when conditioned on and is the jump matrix associated to , moreover conditioned on , the holding times are independent exponential random variables with parameters where in the matrix .

A -matrix is called irreducible if its associated jump chain is irreducible. is explosive if there exists a state such that We say that is an invariant measure of if , which is equivalent to for .

Time reversal for continuous-time Markov chains works as follows. Suppose is irreducible and non-explosive and that has an invaraint distribution . Fix and let be . Put . Then is where satisfies . is irreducible, non-explosive, and has invariant measure .

Diffusion Models

For comparison with usual diffusion models, where the state space is not countable, let's sketch some relevant concepts, albeit in a less rigorous fashion than the above.

Let be a stochastic process on associated with the equation where is the -dimensional Brownian motion and is a matrix valued function. Fix , put and we are interested in the process . Under nice analytical conditions, this time-reversed process satisfies an equation whose coefficients can be computed from and the probability density functions of the random variables . The main point of diffusion modelling is to model with neural networks.

Let us consider some processes. The Ornstein-Ulenback process is defined by an equation with and where , and and where and is the identity matrix. Another example is the overdamped Langevin process with and , which reduces to the OU process with .

Another process is defined by the variance exploading equation: let , put for The process related to the Markov chain used in denoising probablistic model is given by the variance preserving SDE defined for where for some . For the same the sub-variance preserving SDE has and Finally the contractive variance preserving process is associated to the equation with and .

The point of diffusion modelling is to approximate called the score function with a neural network , so for a given we would like to minimize the distance with respect to the probability measure between the score function and the function implemented by the neural network over a set of possible parameters . Expanding and applying integration by parts we may transfer the derivative from score to the neural net. In particular we assume the integral over the boundary vanishes in the limit. We can deduce so the equivalent problem is to approximate The divergence is taken with respect to . This objective function can be computed by sampling then sampling from . In particular, for the example stochastic processes listed above this conditional density is a Gaussian, so sampling is easy. We slightly abuse notation and let be the objective function without the constant. A variant of this loss is time-weighted, namely which is equivalent to minimizing

Since , to compute the divergence we must differentiate each of the -components with repsect to , in other words, backpropagation steps, this is computationally prohibitive. Let denote the Jacobian matrix of partial derivatives and a trace estimation theorem says for a matrix so that now for a fixed we only run a single back propagation from the dot product . Some works have shown that often the expectation can be estimated with a single sample.

We finally come to denoising score matching, first observe From this observation where is independent of . We slightly abuse notation and let be the objective function without the constant. Thus we can optimize for In practice we can choose a process so that is a Gaussian random variable so that has a known closed form.

For a stochastic process defined by an SDE, and densities there exists a whose closed form can be computed from the coefficient of the SDE and the densities for the ODE then and has the same distribution for every . The ODE is called the probability flow ODE.

TBD reparametrization trick, consistency model.

Flow Matching

Let be a measurable space, on which are probability measures. Let be a sufficiently smooth time-varying vector field. Let satisfies For each initial value problem can be thought of as giving rise to a mapping We want to choose such that for every . The notation stands for a measure such that for every measurable set , its measure is . The machine learning problem asks how can we parametrize the vector field by training a neural network given that we can sample from and . We might want to minimize but the flow matching loss is weighted by

Consider a first example. Suppose that these measures are absolutely continuous with respect to the Lebesgue measure on giving rising to probability density functions for . We can express in terms of the conditional density of given , that is, for all Let us review some notation and terminology in this equation. If are real-valued random variables on , their joint density function (suppose it exists) is a function such that for all events The marginal density of with respect to the joint density is The conditional density of X given Y is a function The more common notation is . We short hand as Returning to earlier discussion of , suppose we choose to be the normal density , and choose . Let Indeed, if then is a normal variable with mean and variance . Thus the above convex combination of and has the desired conditioned density .

The machine learning problem is the following. is a standard normal random variable, is distributed according to a data distribution. For we choose giving rise to meausres . Suppose for a flow map originating from the ODE as we have seen above, how can we train a model to predict the vector field ? In this particular example, suppose we fix the boundary value then it is clear the corresponding vector field is This is a particular example of a conditional vector field. More generally one can optimize for the conditional flow matching loss It turns out that .

In our example, we train our neural net by sampling , , and , put . The loss term is