In this post, I’ll try to explain the core idea behind discrete diffusion models. Let’s define our setting:

  • Data (\(x_0\)): A sequence of tokens \(x_0 = (x_0^{(1)}, x_0^{(2)}, \dots, x_0^{(L)})\).
  • Vocabulary (\(V\)): Each token comes from a discrete vocabulary of size \(K\). For text, this is \(V = \{\text{"a"}, \text{"aardvark"}, \dots, \text{"zygote"}, \text{"[MASK]"}\}\).
  • Representation: We can represent a single token \(x^{(i)}\) as a one-hot vector of size \(K\).

The core idea is to replace the addition of Gaussian noise with multiplication by a transition matrix.


The forward process \(q(x_t \vert x_{t-1})\) defines the probability of transitioning from a token at step \(t-1\) to a token at step \(t\). This is defined by a \(K \times K\) stochastic matrix \(Q_t\). The element \([Q_t]_{ij}\) represents the probability of a token \(i\) turning into a token \(j\) in one step.

\begin{equation} q(x_t^{(i)} = j \vert x_{t-1}^{(i)} = i) = [Q_t]_{ij} \end{equation}

If we use one-hot vectors for \(x_{t-1}\) (where it’s a row vector), this transition is a simple matrix multiplication:

\begin{equation} q(x_t \vert x_{t-1}) = \text{Cat}(x_t ; p = x_{t-1} Q_t) \end{equation} This means the probability distribution for \(x_t\) is a Categorical distribution, with probabilities given by the vector \(x_{t-1} Q_t\). This matrix \(Q_t\) is not learned; it is a fixed schedule, just like the \(\beta_t\) schedule in continuous models. The key is in how we design \(Q_t\). Here are the two most common designs for text:

Design 1: Uniform (“Token-Switching”) Diffusion

This matrix says: at step \(t\), with probability \(1-\beta_t\) keep the token the same, and with probability \(\beta_t\), switch it to any token in the vocabulary (including itself) with uniform probability \(1/K\).

Let \(\beta_t\) be the “noise” schedule. The matrix \(Q_t\) is: \(Q_t = (1 - \beta_t) \cdot I + \beta_t \cdot \frac{1}{K} \mathbb{1}\mathbb{1}^T\)

  • \(I\) is the \(K \times K\) identity matrix.
  • \(\mathbb{1}\mathbb{1}^T\) is a \(K \times K\) matrix of all ones.
  • \([Q_t]_{ij} = (1 - \beta_t)\) if \(i=j\), and \(\beta_t/K\) if \(i \neq j\). (Note: the diagonal is actually \((1-\beta_t) + \beta_t/K\))
Design 2: Absorbing (“Masking”) Diffusion

This is the most common for text, as it’s analogous to BERT. We add a special [MASK] token to the vocabulary (let’s say it’s at index \(m\)).

This matrix says: at step \(t\), with probability \(1-\beta_t\) keep the token the same, and with probability \(\beta_t\), replace it with the [MASK] token. Once a token becomes [MASK], it can never change back.

The matrix \(Q_t\) looks like this:

\[\begin{equation} [Q_t]_{ij} = \left\{ \begin{array}{ll} 1 & \text{if } i = j = m \text{ ([MASK] stays [MASK])} \\ 1 - \beta_t & \text{if } i = j \neq m \text{ (Token stays itself)} \\ \beta_t & \text{if } j = m,\, i \neq m \text{ (Token becomes [MASK])} \\ 0 & \text{otherwise ([MASK] never becomes a token)} \end{array} \right. \end{equation}\]

This is an “absorbing state,” which is why it’s called absorbing diffusion.

Just like in continuous models, we want to jump from \(x_0\) to \(x_t\) in one step. Thanks to the Markov property, this is just matrix multiplication:

\begin{equation} \bar{Q}_t = Q_1 Q_2 \dots Q_t \end{equation} Then, the probability of \(x_t\) given \(x_0\) is:

\begin{equation} q(x_t \vert x_0) = \text{Cat}(x_t ; p = x_0 \bar{Q}_t) \end{equation} This \(\bar{Q}_t\) matrix is pre-calculated and fixed. The element \([\bar{Q}_t]_{ij}\) gives the total probability that an initial token \(i\) will have become token \(j\) after \(t\) steps of noising.

This is the “denoising” step where the neural network comes in. We need to learn a model \(p_\theta\) that approximates the “ground truth” posterior \(q(x_{t-1} \vert x_t, x_0)\).

Using Bayes’ theorem, we can derive this ground truth posterior:

\begin{equation} q(x_{t-1} \vert x_t, x_0) \propto q(x_t \vert x_{t-1}, x_0) \cdot q(x_{t-1} \vert x_0) \end{equation} Because the forward process is Markovian, \(q(x_t \vert x_{t-1}, x_0) = q(x_t \vert x_{t-1})\).

\begin{equation} q(x_{t-1} \vert x_t, x_0) \propto \underbrace{q(x_t \vert x_{t-1})}_{\text{from } Q_t} \cdot \underbrace{q(x_{t-1} \vert x_0)}_{\text{from} \bar{Q}_{t-1}} \end{equation}

This gives us a (very messy) but fully computable categorical distribution for \(x_{t-1}\). In one-hot vector notation (with \(\odot\) as element-wise multiplication):

\begin{equation} p = \underbrace{x_t (Q_t)^T}_{\text{Prob. of } x_t \text{ from } x_{t-1}} \odot \underbrace{x_0 \bar{Q}_{t-1}}_{\text{Prob. of } x_{t-1} \text{ from } x_0} \end{equation}

\begin{equation} q(x_{t-1} \vert x_t, x_0) = \text{Cat}\left(x_{t-1} ; p’ = \frac{p}{\sum p}\right) \end{equation}

This formula tells us, “Given the corrupted \(x_t\) and the original \(x_0\), what is the exact probability distribution for what \(x_{t-1}\) was?”

The model \(p_\theta(x_{t-1} \vert x_t)\) is trained to match this ground truth distribution.

  1. Model \(\theta\): This is typically a Transformer. It takes \(x_t\) and the timestep \(t\) as input.
  2. Prediction: The Transformer’s job is to predict \(\hat{x}_0\) (the “clean” text).
    • \[\hat{p}_0 = \text{softmax}(\text{Transformer}(x_t, t))\]
    • \[\hat{x}_0 = \text{one-hot}(\text{argmax}(\hat{p}_0))\]
  3. Loss: The model \(p_\theta\) is formed by plugging this \(\hat{x}_0\) prediction into the Bayes formula from step 3.

\begin{equation} p_\theta(x_{t-1} \vert x_t) \approx q(x_{t-1} \vert x_t, \hat{x}_0) \end{equation}

The training loss is then a Cross-Entropy loss (or KL divergence) that forces this \(p_\theta\) distribution to match the true \(q(x_{t-1} \vert x_t, x_0)\) distribution.

In simple terms: the Transformer (like BERT) is trained to look at a corrupted sentence (e.g., “The [MASK] brown fox”) and predict the original (“The quick brown fox”). This prediction is then used to calculate the “denoising” probability for the previous timestep.