Table of Contents
This is the sixth and final entry in our series building a Diffusion Transformer (DiT) from scratch. The prior posts assembled every piece we need:
- Transformer Architecture — multi-head self-attention, feed-forward blocks, layer norm, residual connections
- Autoencoder Architecture — compressing images into a compact latent space
- Class-Conditional VAE — adaLN conditioning so a decoder knows what to generate
- The Diffusion Paradigm — forward noising, noise schedules, DDPM reverse sampling
In this post we close the loop: we take the DDPM training loop from Part 4 and swap the U-Net denoiser for a full Diffusion Transformer (Peebles and Xiao 2023). The result is a class-conditional generative model on FashionMNIST that produces noticeably sharper samples and scales more cleanly.
1 Why DiT?
1.1 The Limits of U-Net for Diffusion
TODO
1.2 Transformers Scale Better
TODO
1.3 DiT in Context
TODO
2 From Images to Tokens: Patch Embedding
2.1 Patchifying an Image
TODO
2.2 Positional Encoding for Images
TODO
3 Conditioning: Timestep & Class Label
3.1 Recap — adaLN from Part 3
TODO
3.2 Timestep Embedding
TODO
3.3 Label Embedding
TODO
3.4 Fusing Both Signals
TODO
4 The DiT Block
4.1 adaLN-Zero Explained
TODO
4.2 Block Architecture
TODO
4.3 Stacking Blocks
TODO
5 The Full DiT Architecture
5.1 End-to-End Forward Pass
TODO
5.2 Final Layer & Unpatchify
TODO
5.3 Code: Complete DiT Module
6 Training on FashionMNIST
6.1 Setup & Configuration
6.2 Reusing the DDPM Training Loop
TODO
6.3 Training Dynamics
TODO
7 DDPM Sampling with DiT
7.1 Same Algorithm, Different Denoiser
TODO
7.2 Class-Conditional Generation
8 DiT vs U-Net: Head-to-Head
TODO
9 Putting It All Together
TODO
What’s Next?
TODO
Acknowledgements
This tutorial was researched, written, and illustrated by Miguel Chitiva Diaz.
References
Peebles, William, and Saining Xiao. 2023. “Scalable Diffusion Models with Transformers.” arXiv Preprint arXiv:2212.09748.