In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness.
Comparison of DemoFusion with different upsampling strategies. (a) Latent space bicubic upsampling causes manifold deviation, degrading output quality. (b) RGB space bicubic upsampling produces outputs with reduced detail and sharpness. (c) Our learned latent-space upsampling aligns the manifold, resulting in sharp and detailed outputs. Best viewed ZOOMED-IN.
(a) Existing latent upsampling framework rely on progressive upsampling to address manifold deviation. (b) Existing RGB upsampling framework can directly upsample (optionally progressively), but produce smooth output. (c) Our framework enables latent upsampling without progressive upscaling with much fewer denoising steps (\( T_c < T \)) while producing detailed outputs (RNA omitted for simplicity). LR, MR, HR: low/mid/high resolution; DM: Diffusion Model.
The proposed LSRNA enhances reference upsampling with Latent space Super-Resolution (LSR) and Region-wise Noise Addition (RNA). LSR directly maps the low-resolution reference latent onto the high-resolution manifold. RNA then injects region-adaptive noise into the mapped reference, guided by a canny edge map. RNA facilitates detail generation in the higher-resolution generation stage.
We present visual comparisons of the final outputs as the LSR and RNA modules are incorporated into the baseline (DemoFusion), under our default setting of 30 denoising steps without progressive upsampling. Although our LSRNA framework is designed to be compatible with any reference-based method, we adopt DemoFusion as the baseline, as it is a pioneering reference-based approach.
@article{jeong2025latent,
title={Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models},
author={Jeong, Jinho and Han, Sangmin and Kim, Jinwoo and Kim, Seon Joo},
journal={arXiv preprint arXiv:2503.18446},
year={2025}
}