Generating images from open-domain text prompts is difficult because the prompts can describe anything and often contain rich, complex meaning. Traditional conditional generation methods usually require large paired datasets, costly training or fine-tuning, making them expensive and inflexible.
We take a training-free approach by viewing conditional generation as a Bayesian inference problem. In this view, a pre-trained diffusion or flow-based model defines a prior distribution over images, capturing what the model has learned about natural image statistics. Text prompt together with the generated image are fed into a multimodal encoder, which produces a semantic compatibility score. We treat this score as an (unnormalized) likelihood / energy term to guide image generations. Conditional generation thus corresponds to sampling from a posterior that balances two forces: staying close to the image prior while increasing compatibility with the prompt. Crucially, this allows us to guide the sampling process directly, without paired training data or any fine-tuning of the generative model parameters.
We focus on combining state-of-the-art diffusion and flow-based generative models with powerful multimodal large language models as guidance mechanisms. The goal is to develop a simple, general, and training-free framework for controlled generation—capable of producing high-quality images from text prompts with rich and complex semantics.