In order to jointly model text and visual (image or video frames) tokens, the attention mechanism in a
recent paper uses the standard causal mask for text sequences, while patch tokens belonging to the same image may attend to each other bidirectionally, patch tokens can also attend to text or visual patch tokens in previous frames. This is the block attention setup where each image fills a block. While attention is shared between modalities, each modality has a separate feedforward layer. The model has distinct final output heads for the modalities.
For inference the model interleaves continuous flow matching and discrete autoregressive generation: when a begining-of-image token is predicted by autogressive model, a sequence of noise tokens (corresponding to a single image) is generated, and flow-matching proceeds, after which an end-of-image token is appended.
The text modality outputs probabilities over the vocabulary tokens from which cross-entropy (CE) loss can be computed. The image or video heads output velocity field from which flow matching (FM) loss can be computed. The joint loss takes the form The text part of the loss of the model on a seqence is Given a clean image latent , a sample , a sample , let and the model predicts the velocity field and computes
An abalation on the loss comes from the consideration that sometimes the CE loss may become much larger than the FM loss or vice versa, this hinders the learning of the smaller loss modality. To solve this problem, scalars , are not fixed. Specifically according to the paper, for a that calibrates the relative importance of visual and textual losses, and the CE and FM losses at current iteration , one computes the weighted center One then computes the exponential moving average of , namely for a and the weights in joint loss are so that (the constant factor can be omitted, eventhough the paper does not explicitly mention it).
The training setup is around million tokens per step, AdamW optimizer, the learning rate schedule consists of a warmup that linearly increases the LR from zero to in steps, then cosine schedule with min and max . The total number of steps is determined by total FLOPs budget and FLOPs/step. The distributed setup is GPUs, batches per GPU, and tokens per batch.
Visual denoiser uses steps Euler sampler, classifier-free guidance scale and conditioning dropout. Abalations studies different types of visual encoder: variational autoencoders, and representational encoders. Pretrained encoders are used with weights foren during multimodal pretrain. Encoders produce tokens per image (or video frame). The transformer has layers, with model dimension . GQA has query heads and key-value heads. Feedforward has expansion factor . MoE ablations considers from 32 to 1008 experts. Each expert has dimension .