Modern vision-language models map images and text into a shared embedding space, but standard approaches rely on deterministic point embeddings and cannot represent uncertainty or the one-to-many nature of cross-modal correspondence. This project develops GeoFlowVLM, a geometry-aware probabilistic framework that models the joint distribution of image and text embeddings directly on the product hypersphere using Riemannian Flow Matching. The method enables unified joint, conditional, and marginal sampling without retraining the underlying vision-language model. By explicitly modeling cross-modal uncertainty on the manifold, the project aims to improve uncertainty quantification, retrieval robustness, and probabilistic reasoning in vision-language systems.