In this series we build a Diffusion Transformer (DiT) from scratch, covering every building block from the ground up. The tutorial is split into five parts that progressively assemble the full generative pipeline.
Table of Contents
Transformer Architecture — Multi-head self-attention, feed-forward blocks, layer normalization, and residual connections. The foundation that everything else builds on.
Autoencoder Architecture — Convolutional encoder and decoder that compress pixel-space images into a compact latent representation and reconstruct them back. This separates “what the image looks like” from “how to generate it”.
The Diffusion Paradigm — Forward noising process, noise schedules (linear and cosine), the reparameterization trick, and DDPM reverse sampling. How we turn the generation problem into a denoising problem.
ViT: Vision Transformer — Patch embedding, positional encoding, and the full Vision Transformer. Adapting the transformer architecture to operate on images by treating them as sequences of patches.
DiT: Diffusion Transformer (current entry) — Adaptive Layer Norm (adaLN), timestep and label conditioning, DiT blocks, and the complete latent-space denoising network. Putting it all together to generate FashionMNIST images.