Understanding Flatness in Generative Models: Its Role and Benefits

Ulsan National Institute of Science and Technology
ICCV 2025

*Indicates Equal Contribution

Indicates Co-corresponding author

Abstract

Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.

Samples w/ and w/o 8-bit Quantization

Sampling Imgs

Standalone +SAM Achieves Better FID Scores.

Dataset CIFAR-10 (32×32) LSUN Tower (64×64) FFHQ (64×64)
T' 20 steps 100 steps 20 steps 100 steps 20 steps 100 steps
ADM 34.47 8.80 36.65 8.57 30.81 7.53
+EMA 10.63 4.06 7.87 2.49 19.03 6.19
+SWA 11.00 3.78 8.72 2.31 17.93 5.49
+IP 20.11 7.23 25.77 7.00 15.03 13.55
+IP+EMA 9.10 3.46 7.66 2.43 11.72 4.00
+IP+SWA 9.04 3.07 8.55 2.34 12.99 3.54
+SAM 9.01 3.83 16.02 4.79 11.59 5.29
+SAM+EMA 7.00 3.18 6.66 2.30 11.41 5.04
+SAM+SWA 7.27 2.96 6.50 2.27 12.15 4.17

Flatness Leads the Lower Exposure Bias.

MC Flatness

Loss plots under perturbation for CIFAR-10

Eps Norm

Norm of the predicted noise for CIFAR-10

Theorem Overview

Theorem Img

A conceptual illustration of theoretical analysis. Theorem 1 (Corollary 1) translates the perturbation in the parameter space into the set of perturbed distributions. Theorem 2 (Corollary 2) shows that flat minima leads to the robustness against the distribution gap

Theorem 1 (A perturbed distribution)

For a given prior distribution of \( p(\mathbf{x}) \) and the \(\delta\)-perturbed minimum, i.e., \(\boldsymbol{\theta}+\delta\), the following \(\hat{p}(\mathbf{x})\) satisfies the equality: $$ \hat{p}(\mathbf{x}) = e^{-I(\mathbf{x},\delta)} p(\mathbf{x}), $$ where $$ I(\mathbf{x},\delta):= \frac{1}{2}\mathbf{x}^{T}(\delta \mathbf{W}^{T})\mathbf{x} + \mathbf{x}^{T}\delta (\mathbf{U}^{T}\mathbf{e})+C, $$ and \( C \in \mathbb{R} \) is set to satisfy $$ \int_{\mathbb{R}^{d}} e^{-I(\mathbf{x},\delta)}p(\mathbf{x})d\mathbf{x}=1. $$

Corollary 1.1 (Diffusion version of Theorem 1)

For a given prior Gaussian distribution of noise ε ∼ 𝒩(0, 𝐈) and the δ-perturbed minimum, i.e., θ + δ, the following equation is derived: $$ \hat{\epsilon} =e^{-I(\mathbf{x},\delta)} \: \epsilon =\mathcal{N}(\mu_{\delta}, \Sigma_{\delta}), $$ where $$ \Sigma_{\delta}:=\bigg(\mathbf{I}+\frac{\delta_w}{m}\bigg)^{-1}, \mu_{\delta}:=\frac{1}{m}\Sigma_{\delta}\delta_{u}. $$
The perturbed score model targets the score of a nearby distribution p̂(x) instead of the original p(x).

Theorem 2 (A link from Δ-flatness to ε-gap robustness)

A Δ-flat minimum achieves ε-distribution gap robustness, such that ε is upper-bounded as follows:

$$ \text{A } \Delta\text{-flat minimum achieves } \mathcal{E}\text{-distribution-gap robustness, such that } \mathcal{E}\;\le\;\max_{\hat p\sim\hat{\mathcal P}(\mathbf{x};p,\Delta)} D\bigl(p\;\|\;\hat p\bigr). $$

Corollary 2.1 (Diffusion version of Theorem 2)

For a diffusion model, a Δ-flat minimum achieves ε-distribution gap robustness, such that ε is upper-bounded as follows: $$ \mathcal{E} \;\le\; \max_{\|\delta\|_2\le\Delta} \frac12 \Bigl[ \log|\Sigma_{\delta}| - d + \operatorname{tr}\!\bigl(\Sigma_{\delta}^{-1}\bigr) + \boldsymbol\mu_{\delta}^{\!\top}\Sigma_{\delta}^{-1}\boldsymbol\mu_{\delta} \Bigr] $$ $$ \;\;\;\le\; \frac12\Bigl[ \textstyle\sum_{i=1}^{d}\bigl(\sigma_i - \log\sigma_i\bigr) - d + \dfrac{\sigma_{d}}{m^{2}}\, \bigl\|\mathbf{U}^{\!\top}\mathbf{e}\bigr\|_{2}^{2}\,\Delta^{2} \Bigr], $$ where \( \sigma_{1}\le\sigma_{2}\le\cdots\le\sigma_{d} \) are the eigenvalues of \( \Sigma_{\delta}^{-1} \).
Flatter minima with a large ∆ flatten the loss values for the far-pushed away perturbed distribution p̂(x) with large divergence, leading to the larger bound of ε.

BibTeX


        @article{lee2025understanding,
          title={Understanding Flatness in Generative Models: Its Role and Benefits},
          author={Lee, Taehwan and Seo, Kyeongkook and Yoo, Jaejun and Yoon, Sung Whan},
          journal={arXiv preprint arXiv:2503.11078},
          year={2025}
        }