Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52x and 3.2x speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.
Dynamic Patching and Tokenization
Revised patch-embedding layer to support patches of varied resolutions. We modify the standard patch-embedding layer, designed for a fixed patch size p, to additionally support patch sizes pnew.
Dynamic Patch Scheduling
We propose a test-time Dynamic Patch Scheduler that automatically determines the optimal patch size at each timestep, adapting the computational load based on generation complexity and the input prompt. We measure the rate of change of the latent manifold over time. We hypothesize that this rate correlates with the level of detail being generated. If the underlying latent evolves slowly within a short timestep window, we posit that coarse-grained details are being generated. Consequently, we divide the latent into coarser patches and process them, saving computational resources. Conversely, if the underlying latent evolves rapidly, we infer that fine-grained details are being generated and fall back to using finer-grained latent patches.
Results
Qualitative comparisons with the base model, TeaCache, TaylorSeer, and DDiT under similar speedups on DrawBench. DDiT effectively preserves fine-grained details, pose, spatial layout, and overall color distribution of the generated images.
Qualitative comparisons on PartiPrompts. The number next to the method name indicates the amount of computational speedup. Notice how for the same speedup, e.g., 2.2x, TeaCache loses the fine-grained texture in the prompt "A roast turkey." Similar observations hold true for TaylorSeer, where the overall color distribution and the background are not preserved in the prompts "A pumpkin" and "a dolphin."
Qualitative comparison of text-to-video generation between DDiT and the baseline. DDiT produces videos with comparable visual quality to the baseline while achieving significant speedup.
BibTeX
@article{DDiT2026,
title={DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers},
author={Dahye Kim and Deepti Ghadiyaram and Raghudeep Gadde},
journal={arxiv},
year={2026},
}