Continuous-Time Distribution Matching
for Few-Step Diffusion Distillation

1VCIP, College of Computer Science, Nankai University  ·  2Alibaba Group  ·  3College of Artificial Intelligence, Jilin University
*Project leader    Co-corresponding authors
CDM teaser - comparison with DMD2 on Longcat-Image at 4 NFE

CDM enables high-fidelity few-step text-to-image generation. We compare our Continuous-Time Distribution Matching (CDM) against DMD2, both distilled from Longcat-Image (1024×1024) and evaluated at 4 NFE. Without relying on any GAN or reward-model auxiliary objectives, CDM produces sharper textures, richer fine-grained details, and overall higher visual fidelity.

Abstract

Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules—such as GANs or reward models—to restore visual fidelity.

In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student's velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives.

Method Overview

CDM pipeline overview

Overview of Continuous-Time Distribution Matching (CDM). Top: Our approach employs a dynamic continuous time schedule during backward simulation, sampling intermediate anchors uniformly from (0, 1]. Bottom Left: CFG augmentation (CA) and distribution matching (DM) operate on this dynamic schedule to align text-image conditions and data distributions at on-trajectory anchors. Bottom Right: To address inter-anchor inconsistency, the proposed CDM objective explicitly extrapolates off-trajectory latents using the predicted velocity.

Our unified training objective combines three complementary losses:

  • CFG Augmentation Loss (LCA) — anchors structure and semantic text-image alignment;
  • Distribution Matching Loss (LDM) — provides on-trajectory distributional supervision, aligning the student to the teacher's CFG-free distribution;
  • CDM Loss (LCDM) — extends supervision to off-trajectory latents via velocity-driven extrapolation, mitigating numerical truncation errors during few-step inference.

Key Insight: Schedule Decoupling

Empirical evidence of schedule decoupling

Empirical evidence of schedule decoupling. (a) Conventional distillation strictly anchors backward simulation to predefined discrete inference timesteps. In contrast, our dynamic scheduling optimizes over uniformly sampled continuous timesteps t ∈ (0, 1] at each iteration. (b) Visually, the dynamically scheduled model produces finer details and fewer artifacts than the strictly aligned baseline. (c) Quantitatively, it also attains a higher HPSv3 score, indicating that exact discrete alignment is not only unnecessary but in fact restrictive—motivating our continuous-time formulation.

Understanding the DM Loss

Visual evidence on the role of the DM loss

Visual evidence on the role of the DM loss. Samples from teacher models (SD3-Medium and Longcat-Image) with and without CFG, compared against student models distilled with the DM loss alone. Students distilled with the DM loss alone closely match their teachers' CFG-free samples, indicating that the DM loss is not a mere stabilizer but the key driver that aligns the student to the teacher's CFG-free distribution.

Quantitative Results

Comparison of different methods on SD3-Medium and Longcat-Image. CDM achieves state-of-the-art performance across all metrics at only 4 NFE, without requiring any real images, GANs, or reward models.

SD3-Medium (1024×1024)

Method NFE Aesthetic ↑ DPGBench ↑ PickScore ↑ HPSv3 ↑ CLIPScore ↑ Image-Free Continuous
Base (SD3-Medium) 100 5.885 85.04 21.73 8.189 28.60 - -
Hyper-SD 8 5.180 80.43 20.82 6.054 27.93
Flash 4 5.968 80.47 21.69 8.282 28.18
TDM 4 6.013 83.12 21.61 8.468 27.63
DMD2 4 6.038 83.96 21.58 8.419 27.56
D-DMD 4 6.038 84.52 21.85 9.176 27.69
CDM (Ours) 4 6.075 85.26 21.95 9.561 27.98

Longcat-Image (1024×1024)

Method NFE Aesthetic ↑ DPGBench ↑ PickScore ↑ HPSv3 ↑ CLIPScore ↑ Image-Free Continuous
Base (Longcat-Image) 100 5.926 87.08 21.65 9.450 26.78 - -
DMD2 4 5.800 87.12 21.07 8.803 26.99
D-DMD 4 5.782 88.04 21.23 9.629 26.57
CDM (Ours) 4 5.919 88.35 21.53 10.65 26.72

Best and second-best results are highlighted in bold and underline, respectively. The base model serves as a reference and is excluded from the ranking.

Qualitative Comparison on SD3-Medium

Qualitative comparison on SD3-Medium

Qualitative comparison on SD3-Medium. CDM produces more photorealistic results with richer details than competing methods. All results are generated using the same initial noise and random seed for fair comparison. CDM consistently yields sharper textures and fine-grained details, and stronger semantic adherence to multi-entity compositional prompts.

More Qualitative Results

SD3-Medium (4 NFE)

More qualitative results on SD3-Medium

Longcat-Image (4 NFE)

More qualitative results on Longcat-Image

Ablation Study

Qualitative ablation of loss components

Qualitative ablation of loss components across training steps. Left: Individual losses (CA, DM, CDM) in isolation. Right: Pairwise and full combinations. Partial combinations suffer from brightness collapse or degraded local fidelity at later stages, whereas our full objective (CA+DM+CDM) effectively preserves both global semantic coherence and local details.

Loss Components Ablation

Configuration AES ↑ DPG ↑ PICK ↑ HPSv3 ↑ CLIP ↑
(a) Single Loss Ablation
   w/o LCA 5.86172.8721.058.12824.78
   w/o LDM 6.01684.5721.758.95427.66
   w/o LCDM 6.06785.1221.859.15327.91
(b) Dual Loss Ablation
   w/o LDM & LCDM 4.6343.4517.50-10.1514.60
   w/o LCA & LCDM 5.78770.6020.827.25825.31
   w/o LCA & LDM 5.77872.3820.807.33124.78
Full CDM 6.07585.2621.959.56127.98

Core Mechanism Design Ablation

Model Variant AES ↑ DPG ↑ PICK ↑ HPSv3 ↑ CLIP ↑
(a) Time Schedule
   w/ Fixed Schedule 6.05183.8421.899.48227.75
(b) Off-trajectory Perturbation
   w/o Perturbation (on-traj) 6.02784.4321.949.37427.90
   w/ Gaussian Perturbation 6.04084.6521.929.51627.88
(c) Target Latent Construction
   w/ Full-trajectory target 6.02685.1421.929.34627.97
Full CDM 6.07585.2621.959.56127.98

BibTeX

@misc{liu2026continuoustimedistributionmatchingfewstep,
      title={Continuous-Time Distribution Matching for Few-Step Diffusion Distillation}, 
      author={Tao Liu and Hao Yan and Mengting Chen and Taihang Hu and Zhengrong Yue and Zihao Pan and Jinsong Lan and Xiaoyong Zhu and Ming-Ming Cheng and Bo Zheng and Yaxing Wang},
      year={2026},
      eprint={2605.06376},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.06376}, 
}