Diffusion Transformer (DiT) – The Overfitting Club

interactive notebook Open in Colab ↗

This is the sixth and final entry in our series building a Diffusion Transformer (DiT) from scratch. The prior posts assembled every piece we need:

Transformer Architecture — multi-head self-attention, feed-forward blocks, layer norm, residual connections
Autoencoder Architecture — compressing images into a compact latent space
Class-Conditional VAE — adaLN conditioning so a decoder knows what to generate
The Diffusion Paradigm — forward noising, noise schedules, DDPM reverse sampling

In this post we close the loop: we take the DDPM training loop from Part 4 and swap the U-Net denoiser for a full Diffusion Transformer (Peebles and Xiao 2023). The result is a class-conditional generative model on FashionMNIST that produces noticeably sharper samples and scales more cleanly.

1 Why DiT?

1.1 The Limits of U-Net for Diffusion

TODO

1.2 Transformers Scale Better

TODO

1.3 DiT in Context

TODO

2 From Images to Tokens: Patch Embedding

2.1 Patchifying an Image

TODO

2.2 Positional Encoding for Images

TODO

# TODO: PatchEmbed module

3 Conditioning: Timestep & Class Label

3.1 Recap — adaLN from Part 3

TODO

3.2 Timestep Embedding

TODO

3.3 Label Embedding

TODO

3.4 Fusing Both Signals

TODO

# TODO: TimestepEmbedder, LabelEmbedder, fusion

4 The DiT Block

4.1 adaLN-Zero Explained

TODO

4.2 Block Architecture

TODO

4.3 Stacking Blocks

TODO

# TODO: DiTBlock module

5 The Full DiT Architecture

5.1 End-to-End Forward Pass

TODO

5.2 Final Layer & Unpatchify

TODO

5.3 Code: Complete `DiT` Module

# TODO: full DiT module

6 Training on FashionMNIST

6.1 Setup & Configuration

Imports and hyperparameters

# TODO: imports, config

6.2 Reusing the DDPM Training Loop

TODO

# TODO: training loop

6.3 Training Dynamics

TODO

7 DDPM Sampling with DiT

7.1 Same Algorithm, Different Denoiser

TODO

7.2 Class-Conditional Generation

# TODO: sampling loop + 10-class grid

8 DiT vs U-Net: Head-to-Head

TODO

9 Putting It All Together

TODO

What’s Next?

TODO

Acknowledgements

This tutorial was researched, written, and illustrated by Miguel Chitiva Diaz.

References

Peebles, William, and Saining Xiao. 2023. “Scalable Diffusion Models with Transformers.” arXiv Preprint arXiv:2212.09748.

--- title: "Diffusion Transformer (DiT)" subtitle: "Building a Diffusion Transformer from Scratch for Image Generation" author: "Miguel Chitiva Diaz" date: "2026-02-24" categories: [diffusion, transformer, deep-learning, generative-models] # image: assets/thumb.svg # TODO: author thumbnail for the DiT post format: html: toc: true toc-depth: 3 toc-location: left-body toc-title: "Table of Contents" number-sections: true code-fold: true code-tools: true highlight-style: atom-one code-block-bg: true code-copy: hover jupyter: python3 bibliography: references.bib --- {{< colab-btn href="https://colab.research.google.com/github/miguelalexanderdiaz/overfitting_club/blob/main/blog/posts/tutorials/deep_learning/diffusion_transformer/diffusion_transformer.ipynb" >}} This is the sixth and final entry in our series building a Diffusion Transformer (DiT) from scratch. The prior posts assembled every piece we need: 1. [**Transformer Architecture**](../transformer/transformer.qmd) — multi-head self-attention, feed-forward blocks, layer norm, residual connections 2. [**Autoencoder Architecture**](../autoencoder/autoencoder.qmd) — compressing images into a compact latent space 3. [**Class-Conditional VAE**](../class_conditional_vae/class_conditional_vae.qmd) — adaLN conditioning so a decoder knows *what* to generate 4. [**The Diffusion Paradigm**](../the_diffusion_paradigm/the_diffusion_paradigm.qmd) — forward noising, noise schedules, DDPM reverse sampling In this post we close the loop: we take the DDPM training loop from Part 4 and swap the U-Net denoiser for a full Diffusion Transformer [@peebles2023scalable]. The result is a class-conditional generative model on FashionMNIST that produces noticeably sharper samples and scales more cleanly. ## Why DiT? {#sec-motivation} ### The Limits of U-Net for Diffusion TODO ### Transformers Scale Better TODO ### DiT in Context TODO ## From Images to Tokens: Patch Embedding {#sec-patch-embed} ### Patchifying an Image TODO ### Positional Encoding for Images TODO ```{python} #| code-fold: false #| label: patch-embed # TODO: PatchEmbed module ``` ## Conditioning: Timestep & Class Label {#sec-conditioning} ### Recap — adaLN from Part 3 TODO ### Timestep Embedding TODO ### Label Embedding TODO ### Fusing Both Signals TODO ```{python} #| code-fold: false #| label: conditioning-modules # TODO: TimestepEmbedder, LabelEmbedder, fusion ``` ## The DiT Block {#sec-dit-block} ### adaLN-Zero Explained TODO ### Block Architecture TODO ### Stacking Blocks TODO ```{python} #| code-fold: false #| label: dit-block # TODO: DiTBlock module ``` ## The Full DiT Architecture {#sec-full-dit} ### End-to-End Forward Pass TODO ### Final Layer & Unpatchify TODO ### Code: Complete `DiT` Module ```{python} #| code-fold: false #| label: dit-full # TODO: full DiT module ``` ## Training on FashionMNIST {#sec-training} ### Setup & Configuration ```{python} #| code-fold: true #| code-summary: "Imports and hyperparameters" # TODO: imports, config ``` ### Reusing the DDPM Training Loop TODO ```{python} #| code-fold: false #| label: training-loop # TODO: training loop ``` ### Training Dynamics TODO ## DDPM Sampling with DiT {#sec-sampling} ### Same Algorithm, Different Denoiser TODO ### Class-Conditional Generation ```{python} #| code-fold: false #| label: sampling # TODO: sampling loop + 10-class grid ``` ## DiT vs U-Net: Head-to-Head {#sec-comparison} TODO ## Putting It All Together {#sec-full-picture} TODO ## What's Next? {.unnumbered} TODO ## Acknowledgements {.unnumbered} This tutorial was researched, written, and illustrated by [Miguel Chitiva Diaz](https://github.com/miguelalexanderdiaz). ## References {.unnumbered} ::: {#refs} :::

1 Why DiT?

1.1 The Limits of U-Net for Diffusion

1.2 Transformers Scale Better

1.3 DiT in Context

2 From Images to Tokens: Patch Embedding

2.1 Patchifying an Image

2.2 Positional Encoding for Images

3 Conditioning: Timestep & Class Label

3.1 Recap — adaLN from Part 3

3.2 Timestep Embedding

3.3 Label Embedding

3.4 Fusing Both Signals

4 The DiT Block

4.1 adaLN-Zero Explained

4.2 Block Architecture

4.3 Stacking Blocks

5 The Full DiT Architecture

5.1 End-to-End Forward Pass

5.2 Final Layer & Unpatchify

5.3 Code: Complete DiT Module

6 Training on FashionMNIST

6.1 Setup & Configuration

6.2 Reusing the DDPM Training Loop

6.3 Training Dynamics

7 DDPM Sampling with DiT

7.1 Same Algorithm, Different Denoiser

7.2 Class-Conditional Generation

8 DiT vs U-Net: Head-to-Head

9 Putting It All Together

What’s Next?

Acknowledgements

References

5.3 Code: Complete `DiT` Module