Diffusion Transformer (DiT)

Building a Diffusion Transformer from Scratch for Image Generation

diffusion
transformer
deep-learning
generative-models
Author

Miguel Chitiva Diaz

Published

February 24, 2026

interactive notebook Open in Colab

This is the sixth and final entry in our series building a Diffusion Transformer (DiT) from scratch. The prior posts assembled every piece we need:

  1. Transformer Architecture — multi-head self-attention, feed-forward blocks, layer norm, residual connections
  2. Autoencoder Architecture — compressing images into a compact latent space
  3. Class-Conditional VAE — adaLN conditioning so a decoder knows what to generate
  4. The Diffusion Paradigm — forward noising, noise schedules, DDPM reverse sampling

In this post we close the loop: we take the DDPM training loop from Part 4 and swap the U-Net denoiser for a full Diffusion Transformer (Peebles and Xiao 2023). The result is a class-conditional generative model on FashionMNIST that produces noticeably sharper samples and scales more cleanly.

1 Why DiT?

1.1 The Limits of U-Net for Diffusion

TODO

1.2 Transformers Scale Better

TODO

1.3 DiT in Context

TODO

2 From Images to Tokens: Patch Embedding

2.1 Patchifying an Image

TODO

2.2 Positional Encoding for Images

TODO

# TODO: PatchEmbed module

3 Conditioning: Timestep & Class Label

3.1 Recap — adaLN from Part 3

TODO

3.2 Timestep Embedding

TODO

3.3 Label Embedding

TODO

3.4 Fusing Both Signals

TODO

# TODO: TimestepEmbedder, LabelEmbedder, fusion

4 The DiT Block

4.1 adaLN-Zero Explained

TODO

4.2 Block Architecture

TODO

4.3 Stacking Blocks

TODO

# TODO: DiTBlock module

5 The Full DiT Architecture

5.1 End-to-End Forward Pass

TODO

5.2 Final Layer & Unpatchify

TODO

5.3 Code: Complete DiT Module

# TODO: full DiT module

6 Training on FashionMNIST

6.1 Setup & Configuration

Imports and hyperparameters
# TODO: imports, config

6.2 Reusing the DDPM Training Loop

TODO

# TODO: training loop

6.3 Training Dynamics

TODO

7 DDPM Sampling with DiT

7.1 Same Algorithm, Different Denoiser

TODO

7.2 Class-Conditional Generation

# TODO: sampling loop + 10-class grid

8 DiT vs U-Net: Head-to-Head

TODO

9 Putting It All Together

TODO

What’s Next?

TODO

Acknowledgements

This tutorial was researched, written, and illustrated by Miguel Chitiva Diaz.

References

Peebles, William, and Saining Xiao. 2023. “Scalable Diffusion Models with Transformers.” arXiv Preprint arXiv:2212.09748.
Back to top