Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran^*1,2, Michael Poli^*1,2, Daniel Y. Fu^3,4, Dongjun Kim¹,
Lea M. Hadzic¹, Manling Li^1,5, Agrim Gupta⁶, Stefano Massaroli^2,7,
Azalia Mirhoseini¹, Juan Carlos Niebles^†1,8, Stefano Ermon^†1, Li Fei-Fei^†1

¹ Stanford University ² Liquid AI ³ Together AI ⁴ UC San Diego

⁵ Northwestern University ⁶ Google DeepMind ⁷ RIKEN ⁸ Salesforce Research

NeurIPS 2025 Oral

^* Equal contribution, ^† Equal senior authorship

arXiv 🤗 Grafted Models
Code/ Demo
Blog
Thread

Can you spot the images generated by grafted models? Click to find out!

Grafting

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how software is built on existing code, we propose grafting — a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. We show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.

Result I: Hybrid architectures for class-conditional image generation

Grafting is effective for constructing efficient hybrid architectures with good generative quality under small compute budgets. Interleaved designs are particularly effective. We replace softmax attention with gated convolution, local attention, and linear attention, and replace MLPs with variable expansion ratio and convolutional variants, achieving FID scores of 2.38–2.64 using <2% pretraining compute (baseline: FID 2.27 for DiT-XL/2).

Result 1: Hybrid architecture FID scores — High-quality image samples generated using our grafted models (256x256)

Result II: Efficient high-resolution text-to-image (T2I) generation

We graft high-resolution text-to-image DiTs, constructing hybrid architectures with meaningful speedups and minimal quality drop. We achieve a 1.43× speedup for 2048×2048 text-to-image generation using PixArt-Σ, with less than a 2% drop in GenEval score.

Result III: Converting model depth to width via Grafting

Grafting enables architectural restructuring at the transformer block level, allowing model depth to be traded for width. We restructure DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting, reducing model depth by 2× and yielding FID=2.77, better than other models of comparable depth.

BibTeX

@inproceedings{chandrasegaran2024grafting,
      title={Exploring Diffusion Transformer Designs via Grafting},
      author={Chandrasegaran, Keshigeyan and Poli, Michael and Fu, Daniel Y. and Kim, Dongjun and 
      Hadzic, Lea M. and Li, Manling and Gupta, Agrim and Massaroli, Stefano and 
      Mirhoseini, Azalia and Niebles, Juan Carlos and Ermon, Stefano and Li, Fei-Fei},
      url={https://arxiv.org/abs/2506.05340}, 
      booktitle = {Advances in Neural Information Processing Systems},
      year={2025},
      volume = {38}
}

Acknowledgments

We thank Liquid AI for sponsoring compute for this project. We also thank Armin W. Thomas, Garyk Brixi, Kyle Sargent, Karthik Dharmarajan, Stephen Tian, Cristobal Eyzaguirre and Aryaman Arora for their feedback on the manuscript.