Training-Free Conditional Generation via Bayesian Inference

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1679

Type:

NAISS Small Compute

Principal Investigator:

Wenyi Lian

Affiliation:

Uppsala universitet

Start Date:

2025-12-08

End Date:

2026-12-01

Primary Classification:

10207: Computer graphics and computer vision (System engineering aspects at 20208)

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

Generating images from open-domain text prompts is difficult because the prompts can describe anything and often contain rich, complex meaning. Traditional conditional generation methods usually require large paired datasets, costly training or fine-tuning, making them expensive and inflexible. We take a training-free approach by viewing conditional generation as a Bayesian inference problem. In this view, a pre-trained diffusion or flow-based model defines a prior distribution over images, capturing what the model has learned about natural image statistics. Text prompt together with the generated image are fed into a multimodal encoder, which produces a semantic compatibility score. We treat this score as an (unnormalized) likelihood / energy term to guide image generations. Conditional generation thus corresponds to sampling from a posterior that balances two forces: staying close to the image prior while increasing compatibility with the prompt. Crucially, this allows us to guide the sampling process directly, without paired training data or any fine-tuning of the generative model parameters. We focus on combining state-of-the-art diffusion and flow-based generative models with powerful multimodal large language models as guidance mechanisms. The goal is to develop a simple, general, and training-free framework for controlled generation—capable of producing high-quality images from text prompts with rich and complex semantics.