DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

1Boston University   2Amazon
*Work done during internship at Amazon.

Joint last authors.
Figure 1

DDiT dynamically selects the optimal patch size at each denoising step at inference yielding significant computational gains at no loss of perceptual quality. Results are shown for FLUX-1.Dev for text-to-image and Wan-2.1 for text-to-video generation. The top panel denotes the baseline (original model), while the remaining panels illustrate outputs from DDiT at different acceleration rates. ImageReward, CLIP, and VBench scores are reported (higher is better).

Abstract

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52x and 3.2x speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.

Main idea: dynamic tokenization during denoising.

Current methods use the same patch size for all denoising steps during inference time. Instead, DDiT adapts the patch size at each timestep according to the latent complexity, allocating fewer tokens for certain timesteps and more tokens for certain others. While DiT divides VAE latents into patches, for illustrative purposes, we use a real image in pixel space.

Figure 2

Dynamic Patching and Tokenization

Revised patch-embedding layer to support patches of varied resolutions. We modify the standard patch-embedding layer, designed for a fixed patch size p, to additionally support patch sizes pnew.

Figure 3

Dynamic Patch Scheduling

We propose a test-time Dynamic Patch Scheduler that automatically determines the optimal patch size at each timestep, adapting the computational load based on generation complexity and the input prompt. We measure the rate of change of the latent manifold over time. We hypothesize that this rate correlates with the level of detail being generated. If the underlying latent evolves slowly within a short timestep window, we posit that coarse-grained details are being generated. Consequently, we divide the latent into coarser patches and process them, saving computational resources. Conversely, if the underlying latent evolves rapidly, we infer that fine-grained details are being generated and fall back to using finer-grained latent patches.

Results

Qualitative comparisons with the base model, TeaCache, TaylorSeer, and DDiT under similar speedups on DrawBench. DDiT effectively preserves fine-grained details, pose, spatial layout, and overall color distribution of the generated images.

Zoom-in qualitative comparison

Qualitative comparisons on PartiPrompts. The number next to the method name indicates the amount of computational speedup. Notice how for the same speedup, e.g., 2.2x, TeaCache loses the fine-grained texture in the prompt "A roast turkey." Similar observations hold true for TaylorSeer, where the overall color distribution and the background are not preserved in the prompts "A pumpkin" and "a dolphin."

Additional qualitative comparison

Qualitative comparison of text-to-video generation between DDiT and the baseline. DDiT produces videos with comparable visual quality to the baseline while achieving significant speedup.

BibTeX

@article{DDiT2026,
  title={DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers},
  author={Dahye Kim and Deepti Ghadiyaram and Raghudeep Gadde},
  journal={arxiv},
  year={2026},
}